introduction to openacc and openmp gpu - idris1 introduction definitions short history architecture...
TRANSCRIPT
wwwidrisfr Institut du deacuteveloppement et des ressources en informatique scientifique
Introduction to OpenACC andOpenMP GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery Pierre-Franccedilois Lavalleacutee Thibaut Veacutery
1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 2 77
1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77
Introduction - Useful definitions
bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =
32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 2 77
1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77
Introduction - Useful definitions
bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =
32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77
Introduction - Useful definitions
bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =
32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Useful definitions
bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =
32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =
32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32
times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80
times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Comparison of CPUGPU cores
The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions
bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - The ways to GPU
Applications
Programming effort and technical expertise
Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA
bull Minimum change in the code bull Maximum performance
Directives bull OpenACC bull OpenMP 50
bull Portable simple low intrusiveness
bull Still efficient
Programming languages bull CUDA bull OpenCL
bull Complete rewriting complex bull Non-portable bull Optimal performance
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Short history
OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers
minus PGIminus Cray (for Cray hardware)minus GCC (since 57)
OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers
minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Architecture of a converged node on Jean Zay
PLX PLX
900 GBs
16 GBs
128 GBs
25 GBs
124 GBs
1 co
nve
rged
no
de
on
Je
an Z
ay
Ho
st (CP
U)
Device
(GP
U)
Reacuteseau
Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)
208 GBs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Architecture CPU-GPU
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Execution model OpenACC
Execution controlled by the host
Launch kernel1
Launch kernel2
Launch kernel3
CPU
Acc
In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Execution model OpenACC
OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq
By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active
These modes must be combined in order to get the best performance from the acclerator
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Execution model
The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is
nbthreads = nbgangs lowast nbworkers lowast vectorlength
Important notesbull No synchronization is possible between
gangsbull The compiler can decide to synchronize the
threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode
(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Execution model
NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the
programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a
worker is set at to 32
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Execution model
ACC parallel (starts gangs)
ACC loop worker
ACC loop vector
ACC parallel (starts gangs)
ACC loop worker vector
ACC kernels num_gangs(2) num_workers(5) vector_length(8)
Active threads2
10
80
10
2
2
80
2
80
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Introduction - Porting strategy
When using OpenACC it is recommended tofollow this strategy
1 Identify the compute intensive loops(Analyze)
2 Add OpenACC directives (Parallelization)
3 Optimize data transfers and loops(Optimizations)
4 Repeat steps 1 to 3 until everything is onGPU
Optimize data
Analyze
Parallelizeloops
Optimize loops
Pick a kernel
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
PGI compiler
During this training course we will use PGI forthe pieces of code containing OpenACCdirectives
Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and
profilers
Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran
Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information
about compilation The compiler will doimplicit operations that you want to be awareof
Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
PGI Compiler - -ta options
Important options for -ta
Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70
For example if you want to compile for V100 -ta=teslacc70
Memory managementbull pinned The memory location on the host is
pinned It might improve data transfersbull managed The memory of both the host and
the device(s) is unified
Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
PGI Compiler - Compiler information
Information available with -Minfo=accel
1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000
6 a ( i ) = ienddo
end program loop
exemplesloopf90
program reduc t ion2 i n t e g e r a (10000)
i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000
7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
sum = sum + a ( i )12 enddo
end program reduc t ion
exemplesreductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
PGI Compiler - Compiler information
Information available with -Minfo=accel
$ACC p a r a l l e l loop2 do i =1 10000
a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
7 sum = sum + a ( i )enddo
exemplesreductionf90
To avoid CPU-GPU communication a data region is added
$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop
3 do i =1 10000a ( i ) = i
enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000
8 sum = sum + a ( i )enddo$ACC end data
exemplesreduction_data_regionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Analysis - Code profiling
Here are 3 tools available to profile your code
PGI_ACC_TIMEbull Command line toolbull Gives basic information
aboutminus Time spent in kernelsminus Time spent in data
transfersminus How many times a kernel
is executedminus The number of gangs
workers and vector sizemapped to hardware
nvprofbull Command line toolbull Options which can give you
a fine view of your code
nvvppgprofNSightGraphics Profilerbull Graphical interface for
nvprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Analysis - Code profiling PGI_ACC_TIME
For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvprof ltoptionsgt executable_file ltarguments of executable filegt
1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]
3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu
6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red
exemplesprofilsgol_opti2prof
Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics
dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU
bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool
$ nvvp ltoptionsgt executable lt--args arguments of executablegt
or if you want to open a profile generated with nvprof
$ nvvp ltoptionsgt progprof
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Analysis - Graphical profiler NVVP
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 Introduction
2 ParallelismOffloaded Execution
Add directivesCompute constructsSerialKernelsTeamsParallel
Work sharingParallel loopReductionsCoalescing
RoutinesRoutines
Data ManagementAsynchronism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Add directives
To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code
FortranOpenACC
$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e
CC++OpenACC
pragma acc d i r e c t i v e ltclauses gt2
Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading
To run on the device you need to specify one of the 3 following directives (called computeconstruct)
serial
The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated
kernels
The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region
parallel
The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single
The programmer has to sharethe work manually
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Offloaded execution - Directive serial
The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)
Data Default behaviorbull Arrays present in the serial region not
specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED
bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE
Fortran1 $ACC s e r i a l
do i =1 ndo j =1 n enddo
6do j =1 n enddo
enddo11 $ACC end s e r i a l
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Offloaded execution - Directive serial
$ACC s e r i a ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end s e r i a l
exemplesgol_serialf90
Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading - Kernels In the compiler we trust
The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)
Data Default behaviorbull Data arrays present inside the kernels
region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED
bull Scalar variables are implicitly assigned to acopy clause They are SHARED
Fortran$ACC kerne ls
2do i =1 ndo j =1 n enddo
7 do j =1 n enddo
enddo$ACC end kerne ls
12
Important notesbull The parameters of the parallel regions are
independent (gangs workers vector size)bull Loop nests are executed consecutively
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading - Kernels
$ACC kerne lsdo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end kerne ls
exemplesgol_kernelsf90
Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading - parallel directive
The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs
Data default behaviorbull Data arrays present inside the region and
not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED
bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE
$ACC p a r a l l e l2a = 2 Gangminusredundant
$ACC loop Work shar ingdo i = 1 n enddo
7 $ACC end p a r a l l e l
Important notesbull The number of gangs workers and the vector size are constant inside the region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading - parallel directive
$ACC p a r a l l e ldo g=1 generat ions
do r =1 rowsdo c=1 co ls
35 oworld ( r c ) = wor ld ( r c )enddo
enddodo r =1 rows
do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
45 else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
enddo50 c e l l s = 0
do r =1 rowsdo c=1 co ls
c e l l s = c e l l s + wor ld ( r c )enddo
55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallelf90
acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution offloading - parallel directive
acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)
The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelization control
The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)
Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size
program param2 i n t e g e r a (1000)
i n t e g e r i
$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)
7 p r i n t lowast He l lo I am a gang do i =1 1000
a ( i ) = ienddo
$ACC end p a r a l l e l12end program
exemplesparametres_paralf90
Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the
compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with
care
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer
Clausesbull Level of parallelism
minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)
minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads
minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads
minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to
respect dependancies
Work sharing (loopiterations)
Activation of newthreads
loop gang Xloop worker X Xloop vector X X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Work sharing - Loops
Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)
Notesbull You might see the directive do for compatibility
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Work sharing - Loops
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0do r =1 rows
do c=1 co lsc e l l s = c e l l s + wor ld ( r c )
55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end p a r a l l e l
exemplesgol_parallel_loopf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Work sharing - Loops
serial
$acc s e r i a ldo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end s e r i a l
exemplesloop_serialf90
bull Active threads 1bull Number of operations nx
Execution GRWSVS
10 $acc p a r a l l e l num_gangs (10)do i =1 nx
A( i )=B( i )lowast iend do $acc end p a r a l l e l
exemplesloop_GRWSVSf90
bull Redundant execution bygang leaders
bull Active threads 10bull Number of operations 10
nx
Execution GPWSVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVSf90
bull Each gang executes adifferent block ofiterations
bull Active threads 10bull Number of operations nx
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution GPWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVSf90
bull Iterations are sharedamong the active workersof each gang
bull Active threads 10 workers
bull Number of operations nx
Execution GPWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWSVPf90
bull Active threads 10 workers
bull Iterations are sharedamong the threads of theworker of all gangs
bull Active threads 10 vectorlength
bull Number of operations nx
Execution GPWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GPWPVPf90
bull Iterations are shared onthe grid of threads of allgangs
bull Active threads 10 workers vectorlength
bull Number of operations nx
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Execution GRWPVS
10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVSf90
bull Each gang is assigned allthe iterations which areshared among theworkers
bull Active threads 10 workers
bull Number of operations 10 nx
Execution GRWSVP
10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWSVPf90
bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker
bull Active threads 10 vectorlength
bull Number of operations 10 nx
Execution GRWPVP
10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx
A( i )=B( i )lowast iend do
15 $acc end p a r a l l e l
exemplesloop_GRWPVPf90
bull Each gang is assigned alliterations and the grid ofthreads distributes thework
bull Active threads 10 workers vectorlength
bull Number of operations 10 nx
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Work sharing - Loops
Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism
10 $acc p a r a l l e l $acc loop gangdo i =1 nx
A( i )=10 _8end do
15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_syncf90
bull Result sum = 97845213 (Wrong value)
10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx
A( i )=10 _8end do
15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1
somme=somme+A( i )end do $acc end p a r a l l e l
exemplesloop_pb_sync_corrf90
bull Result sum = 100000000 (Right value)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Merging kernelsparallel directives and loop
It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs
For example with kernels directive
program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt
5 do i =1 1000a ( i ) = b ( i )lowast2
enddo$ACC end kerne ls
The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt
do i =1 1000a ( i ) = b ( i )lowast2
enddoend program fused
exemplesfusedf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable
Operations
Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)
Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran
Reduction operators
RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Reductions example
$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddodo r =1 rows
40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i fenddo
50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo60 $ACC end p a r a l l e l
exemplesgol_parallel_loop_reductionf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the
use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory
location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to
memory
Example with laquoCoalescingraquo Example with laquonoCoalescingraquo
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
$acc p a r a l l e l10 $acc loop gang
do i =1 nx $acc loop vec to rdo j =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop_nocoalescingf90
$acc p a r a l l e l20 $acc loop gang
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_825 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
$acc p a r a l l e l30 $acc loop gang vec to r
do i =1 nx $acc loop seqdo j =1 nx
A( i j )=114_835 end do
end do $acc end p a r a l l e l
exemplesloop_coalescingf90
Without Coalescing With CoalescingTps (ms) 439 16
The memory coalescing version is 27 times faster
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)
With coalescing
gang(j)vector(i)
gang-vector(i)seq(j)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
With coalescing
gang(j)vector(i)T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
T1
T1
X
X
T2
T2
X
X
T3
T3
X
X
T4
T4
X
X
T5
T5
X
X
T6
T6
X
X
T7
T7
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Memory accesses and coalescing
Without coalescing
gang(i)vector(j)T0
T0
X
X
X
X
T1
T1
X
X
X
X
T2
T2
X
X
X
X
T3
T3
X
X
X
X
T4
T4
X
X
X
X
T5
T5
X
X
X
X
T6
T6
X
X
X
X
T7
T7
X
X
X
X
With coalescing
gang(j)vector(i)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T1
T2
T3
T4
T5
T6
T7
T0
T1
T2
T3
T4
T5
T6
T7
gang-vector(i)seq(j)X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
T0
T0
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T6
T6
T7
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine
program r o u t i n ei n t e g e r s = 10000
3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s
8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins
13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
18 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutine_wrongf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Routines
If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version
program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )
4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s
c a l l f i l l ( a r ray ( ) s i )9 enddo
$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )
14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s
19 ar ray ( i j ) = 2enddo
end subrou t ine f i l lend program r o u t i n e
exemplesroutinef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Opening a data region
There are several ways of making data visible on devices by opening different kinds of dataregions
Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels
Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives
Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block
Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive
NotesThe actions taken for the data inside these regions depend on the clause in which they appear
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Data clauses
H Host D Device variable scalar or array
Data movementbull copyin The variable is copied HrarrD the
memory is allocated when entering theregion
bull copyout The variable is copied DrarrH thememory is allocated when entering theregion
bull copy copyin + copyout
No data movementbull create The memory is allocated when
entering the regionbull present The variable is already on the
devicebull delete Frees the memory allocated on the
device for this variable
By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility
Other clausesbull no_createbull deviceptr
bull attachbull detach
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Shape of arrays
To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++
FortranThe array shape is to be specified inparentheses You must specify the first and lastindex
5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000
do j =1 100010 a ( i j ) = 0
enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )
15do i =1 1000do j =100199
a ( i j ) = 42enddo
enddo
exemplesarrayshapef90
CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements
5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )
f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0
Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )
f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42
exemplesarrayshapec
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Shape of arrays
Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified
Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0
Fortran1)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)i n t e g e r i l
5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = i10 enddo
$ACC end p a r a l l e lend program para
exemplesparallel_dataf90
Data region
Parallel region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Parallel regions
Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution
program parai n t e g e r a (1000)
3 i n t e g e r i l
$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
8 a ( i ) = ienddo$ACC end p a r a l l e l
do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )
$ACC loopdo i =1 1000
a ( i ) = a ( i ) + 1enddo
18 $ACC end p a r a l l e lenddo
end program para
exemplesparallel_data_multif90
Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and
exit so potentially 200 data transfers)
Not optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
Optimal
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Local data regions
It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive
5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000
a ( i ) = ienddo
10 $ACC end p a r a l l e l
$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100
$ACC p a r a l l e l15 $ACC loop
do i =1 1000a ( i ) = a ( i ) + 1
enddo$ACC end p a r a l l e l
20 enddo$ACC end data
exemplesparallel_data_singlef90
Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit
so potentially 2 data transfers)
OptimalNo You have to move the beginning of the dataregion before the first loop
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device
Naive version Parallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin
copyout
copy
create
copyin
copyout
copy
create
bull Transfers 8bull Allocations 2
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)
A
B
C
D
copyin copyin
copyout copyout
copy copy
create create
Lets add a data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin copyincopyin
copyout copyout copyout
copy copy copy
create create create
AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device
Optimize data transfersParallel region(+ Data region)
Parallel region(+ Data region)Data region
A
B
C
D
copyin present presentupdate device
copyout present present
copy present presentupdate self
create present present
The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective
bull Transfers 6bull Allocation 1
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dir_commentf90
Without update
The a array is not initialized on the host beforethe end of the data region
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region
self (or host)Variables included within self are updated DrarrH
$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000
a ( i ) = 010 enddo
do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)
15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000
a ( i ) = a ( i ) + rngenddo
20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l
25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)
exemplesupdate_dirf90
With update
The a array is initialized on the host after theupdate directive
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - update device
Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region
deviceThe variables included inside a device clause areupdated HrarrD
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Global data regions enter data exit data
Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors
1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )
6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute
i n t e g e r i11 $ACC p a r a l l e l loop
do i =1 svec ( i ) = 12
enddoend subrou t ine
16end program enterdata
exemplesenter_dataf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Global data regions declare
Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example
Zone ScopeModule the programfunction the functionsubroutine the subroutine
module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000
4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )
end module
exemplesdeclaref90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Time line
Host
Device
Data transfer to the device Update data DrarrH Exit data region
a=0b=0c=0d=0
a=10b=-4c=42d=0
a=10b=c=42d=
copyin(ac) create(b) copyout(d)
a=10b=26c=42
d=158
a=10b=26c=42d=0
update self(b)
a=42b=42c=42
d=158
a=10b=26c=42
d=158
copyout(d)
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Data on device - Important notes
bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance
bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data
Directive updateThe update directive is used to update data either on the host or on the device
Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )
4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )
exemplesupdatef90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Asynchronism
By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant
Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads
Kernel 1
Data transfer
Kernel 2
Kernel 1
Data transfer
Kernel 2
Kernel 1 and transfer async
Kernel 1
Data transfer
Kernel 2
Everything isasync
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop async ( 3 )do i =1 s
c ( i ) = 114 enddo
exemplesasync_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Parallelism - Asynchronism - wait
To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available
Important notesbull The argument of wait or wait is a list of
integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2
Kernel 1
Data transfer
Kernel 2
Kernel 2 waitsfor transfer
b i s i n i t i a l i z e d on hostdo i =1 s
b ( i ) = i4 enddo
$ACC p a r a l l e l loop async ( 1 )do i =1 s
a ( i ) = 42enddo
9 $ACC update device ( b ) async ( 2 )
$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s
b ( i ) = b ( i ) + 114 enddo
exemplesasync_wait_slidesf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Debugging - PGI autocompare
bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution
is stopped
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Debugging - PGI autocompare
bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)
bull Example of code that has a race condition
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC end p a r a l l e l
40 $ACC p a r a l l e ldo r =1 rows
do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
50 end i fenddo
enddo$ACC end p a r a l l e l$ACC p a r a l l e l
55do r =1 rowsdo c=1 co ls
exemplesoptigol_opti1f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s
The time to solution is increased since the codeis executed sequentially and redundantly by allgangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Optimization - Game of LifeLets share the work among the active gangs
do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows
35 do c=1 co lsoworld ( r c ) = wor ld ( r c )
enddoenddo$ACC p a r a l l e l loop
40do r =1 rowsdo c=1 co ls
neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) thenworld ( r c ) = 1
end i f50 enddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows
do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )
enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
60enddo
exemplesoptigol_opti2f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s
Loops are distributed among gangs
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Optimization - Game of LifeLets optimize data transfers
c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0
5 $ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsoworld ( r c ) = wor ld ( r c )
enddo10enddo
$ACC p a r a l l e l loopdo c=1 co ls
do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp
15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)
i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0
else i f ( neigh == 3) then20 world ( r c ) = 1
end i fenddo
enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )
25 do c=1 co lsdo r =1 rows
c e l l s = c e l l s + wor ld ( r c )enddo
enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s
enddo$ACC end data
exemplesoptigol_opti3f90
Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s
Transfers HrarrD and DrarrH have been optimized
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
GPU offloadingPARALLEL GRWSVS mode
- each team exe-cutes instructionsredundantly
TEAMS create teams ofthreads executeredundantly
SERIAL Only 1 thread exe-cutes the code
TARGET GPU offloadinginitial thread only
KERNELS Offloading + au-tomatic paralleliza-tion
X Does not exist inOpenMP
Implicit transferHrarrD or DrarrH
scalar variables non-scalar vari-ables
scalar variablesnon-scalar vari-ables
Explicit transferHrarrD od DrarrHinside explicitparallel regions
Clauses PARAL-LELSERIALKER-NEL
Clause TARGET Data movementdefined from Hostperspective
copyin() copy HrarrD whenentering construct
map(to)
copyout() copy DrarrH whenleaving construct
map(from)
copy() copy HrarrD whenentering and DrarrHwhen leaving
map(tofrom)
create() created on device no transfers
map(alloc)
present()delete()
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
Data region
declare() Add data to theglobal data region
dataend data inside the sameprogramming unit
TARGET DATAMAP(tofrom)ENDTARGET DATA
inside the sameprogramming unit
enter dataexit data opening and clos-ing anywhere in thecode
TARGET ENTERDATATARGETEXIT DATA
opening and clos-ing anywhere in thecode
update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region
update device () update HrarrD UPDATE TO Copy HrarrD insidea data region
Loopparallelization
kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops
X Does not exist inOpenMP
loop gangwork-ervector
Share iterationsamong gangsworkers and vec-tors
distribute Share iterationsamong TEAMS
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes
routines routine [seqvec-torworkergang]
compile a deviceroutine with speci-fied level of paral-lelism
asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync
nowaitdepend manages asyn-chronism et taskdependancies
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Some examples of loop nests parallelized with OpenACC and OpenMP
OpenACC
$acc p a r a l l e l10 $acc loop gang worker vec to r
do i =1 nxA( i )=114_8lowast i
end do $acc end p a r a l l e l
exemplesloop1Df90
$acc p a r a l l e l10 $acc loop gang worker
do j =1 nx $acc loop vec to rdo i =1 nx
A( i j )=114_815 end do
end do $acc end p a r a l l e l
exemplesloop2Df90
OpenMP
$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD
do i =1 nxA( i )=114_8lowast i
end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR
15 $OMP END TARGET
exemplesloop1DOMPf90
$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx
$OMP PARALLEL DO SIMDdo i =1 nx
A( i j )=114_8end do
15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop2DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
OpenACC
$acc p a r a l l e l10 $acc loop gang
do k=1 nx $acc loop workerdo j =1 nx
$acc loop vec to r15 do i =1 nx
A( i j )=114_8end do
end doend do
20 $acc end p a r a l l e l
exemplesloop3Df90
OpenMP
$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx
$OMP PARALLEL DOdo j =1 nx
$OMP SIMDdo i =1 nx
15 A( i j k )=114_8end do$OMP END SIMD
end do$OMP END PARALLEL DO
20 end do$OMP END TARGET TEAMS DISTRIBUTE
exemplesloop3DOMPf90
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
1 Introduction
2 Parallelism
3 GPU Debugging
4 Annex Optimization example
5 Annex OpenACC 27OpenMP 50 translation table
6 Contacts
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-
Contacts
Contacts
bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr
Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77
- Introduction
-
- Definitions
- Short history
- Architecture
- Execution models
- Porting strategy
- PGI compiler
- Profiling the code
-
- Parallelism
-
- Offloaded Execution
-
- Add directives
- Compute constructs
- Serial
- KernelsTeams
- Parallel
-
- Work sharing
-
- Parallel loop
- Reductions
- Coalescing
-
- Routines
-
- Routines
-
- Data Management
- Asynchronism
-
- GPU Debugging
- Annex Optimization example
- Annex OpenACC 27OpenMP 50 translation table
- Contacts
-