![Page 1: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/1.jpg)
Porting Telemac–Mascaret to OpenPower and
experimenting GPU offloading to accelerate
the Tomawac module
TUC 2019 16-17th October, CERFACS, Toulouse, France
Judicael Grasset(1), Stephen Longshaw(1), Charles Moulinec(1), David R. Emerson(1)
Yoann Audouin(2), Pablo Tassi(2)
October 17, 2019
(1) STFC, Daresbury Laboratory, Warrington, United Kingdom
(2) EDF R&D, Chatou, France
![Page 2: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/2.jpg)
Computing used
OpenPower architecture in a
nutshell:
• IBM POWER processors
• NVIDIA GPUs
• NVIDIA NVLink The machine used for this work, Paragon
In our case, each node of the machine used consists of:
• 2 IBM POWER8 processors, with 8 cores each
• Each core has simultaneous multithreading (SMT) capability
• In this case the cores are able to run either 1 thread (SMT1), 2
threads (SMT2), 4 threads (SMT4) or 8 threads (SMT8) at the
same time
• 4 NVIDIA P100 GPUs
• NVIDIA NVLink for GPU–GPU and GPU–CPU interconnections1
![Page 3: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/3.jpg)
Porting to OpenPower
• Why? Summit and Sierra, the 2 most powerful cluster in the world
are based on an OpenPower architecture (Top500, June 2019)
• Porting to different architecure might reveal some bugs in the code
(increased robustness)
2
![Page 4: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/4.jpg)
Porting to OpenPower
Status of the port:
Version > PGI 18.10 > GCC 9.1 > XL 16.1.1.1
v8p0r2 compile compile does not compile*
trunk (Oct. 2019) does not compile* compile does not compile*
*problem known and solved, it compile when applying a small patch
All tests done with the Spectrum MPI library
3
![Page 5: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/5.jpg)
.
Experimenting with GPUs
Or trying to port Telemac to the architecture of the ���future present
4
![Page 6: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/6.jpg)
The test case
Test case used: tomawac/fetch limited/tom test6.cas
• This is a limited test with a small mesh: 75k elements, 32k points.
• It spends all of its time in a single fortran subroutine: qnlin3.f
• This function was reported to be a bottleneck by some users during
the annual TELEMAC User Conference (2018).
5
![Page 7: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/7.jpg)
qnlin3.f
In a nutshell:
• do loop
• init some variables
• do loop
• init some variables
• do loop
• init some variables
• do loop
• tmp array(x,y,z) = tmp array(x,y,z) + k
6
![Page 8: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/8.jpg)
Porting to GPUs, methods
Different solutions exist:
• Pragma based: OpenMP, OpenACC
• Library based: Magma, cuBLAS...
• Language extension: CUDA, OpenCL
7
![Page 9: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/9.jpg)
MPI+OpenACC (PGI compiler) on GPU
Move data to GPU and execute the loop on it.
• !$acc data copy(array)
• !$acc parallel loop collapse(4)
• do loop
• do loop
• do loop
• do loop
• !$acc atomic
• array(x,y,z) = array(x,y,z) + k
• ...
• !$acc end data
Elsewhere during the initialisation of the code, we have linked each MPI
task to a specific GPU.
8
![Page 10: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/10.jpg)
MPI+OpenACC (PGI compiler) on GPU
9
![Page 11: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/11.jpg)
MPI+OpenMP (IBM compiler) on GPU
Move data to GPU and execute the loop on it.
• !$omp target data map(array)
• !$omp target teams distribute parallel do collapse(4)
• do loop
• do loop
• do loop
• do loop
• !$omp atomic
• array(x,y,z) = array(x,y,z) + k
• ...
• !$omp end target data
Elsewhere during the initialisation of the code, we have linked each MPI
task to a specific GPU.
10
![Page 12: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/12.jpg)
MPI+OpenMP (IBM compiler) on GPU
11
![Page 13: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/13.jpg)
Somme test-case
• Somme 7 days
• Telemac2d-Tomawac-Sisyphe
20.8%
6%6%6.6%
6.9%
9.6%
11.4%
11.6%
21.1%
other subroutinessemimpqwind1propa
fremoyschar41 per 4dlogqnlin1bief interp
12
![Page 14: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/14.jpg)
Inclusion in the codebase
• OpenACC and OpenMP redundancy
• Could be solved with pragma in this case
• But might not always be possible
• Usage of the optional directory
13
![Page 15: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/15.jpg)
Conclusion
Results achieved:
• Telemac-Mascaret ported to OpenPower
• The port revelead bugs in Telemac-Mascaret and some compilers
• Good improvement when using GPU for the qnlin3 subroutine
• Work still going on, but will be more difficult for real world test-case
14
![Page 16: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/16.jpg)
Acknowledgements
• This work is supported by the Hartree Centre through the Innovation
Return on Research (IROR) programme.
15
![Page 17: Porting Telemac{Mascaret to OpenPower and experimenting ......Porting Telemac{Mascaret to OpenPower and experimenting GPU o oading to accelerate the Tomawac module TUC 2019 16-17th](https://reader034.vdocuments.mx/reader034/viewer/2022051603/5fedded9dfd70977485cf2a7/html5/thumbnails/17.jpg)
Thank you for your attention
If you think the code is too slow, or uses to much memory for you
(partel, Telemac, Tomawac...)
Please contact us.
Contact:
[email protected] [email protected]
16