radiation solver with openacc accelerating the...
TRANSCRIPT
1 © 2016 ANSYS, Inc. November, 2016
Accelerating the ANSYS Fluent R18.0Radiation Solver with OpenACC
Sunil Sathe, Lead Software Developer
2 © 2016 ANSYS, Inc. November, 2016
Outline
• Fluent heterogeneous computing (HTC) infrastructure
• PGI/OpenACC for the HTC infrastructure
• Build and execution model
• Discrete ordinate (DO) radiation solver options
• CPU/GPU co-computing
• Sample OpenACC pragma in HTC
• Performance
• Summary
3 © 2016 ANSYS, Inc. November, 2016
Fluent HTC infrastructure
Domain DecompositionMPI Rank 0 MPI Rank 1
MPI Rank 2 MPI Rank 3
GPU GPU
GPU GPU
CPU Cell
GPU Cell
Case
CPU Domain GPU Domain
CPU Cells GPU Cells
CPU Faces GPU Faces
Abstract Domain
Abstract Cells
Abstract Faces
OpenACC CodeAlgorithms Data Structure
4 © 2016 ANSYS, Inc. November, 2016
PGI/OpenACC for the HTC infrastructure
● Hardware portability
○ NVIDIA GPUs
○ Intel x86
○ Multi-core
○ OpenPower
○ ARM
● OS portability
○ Windows
○ Linux
● Performance portability
○ Competitive performance with best-in-class
compilers and programming models on individual platforms
● Ease of programming model
○ Simple pragma directives
5 © 2016 ANSYS, Inc. November, 2016
Build and execution model
fluent_mpi.x.y.z libhtc.so
Fluent Native Source Code
Fluent HTC Source Code
Compiled with pgc++ OpenACC support
Dynamically loaded
Compiled
Executed
6 © 2016 ANSYS, Inc. November, 2016
Discrete ordinate radiation solver options
Flue
nt N
ativ
e S
olve
r
Flue
nt N
ativ
e S
olve
r
Flue
nt N
ativ
e S
olve
r
Flue
nt N
ativ
e S
olve
r
CPU GPU CPU GPU CPU GPU CPU GPU
Fluent-HTCDO Solver
Fluent-HTCDO Solver
Fluent-HTCDO Solver
Fluent-HTCDO Solver
CPU Computation in Fluent Native DO Solver
CPU Computation in Fluent-HTC DO Solver
GPU Computation in Fluent-HTC DO Solver
CPU/GPU Hybrid Computation in Fluent-HTC DO Solver
7 © 2016 ANSYS, Inc. November, 2016
CPU/GPU co-computing
Time
Launch GPU kernels asynchronously
Compute on CPU simultaneously
Wait for GPU to finish
● Divide work between CPU and GPU
● Run the same code on both CPU and GPU
● Use OpenACC “loop” pragma to accelerate the GPU work
8 © 2016 ANSYS, Inc. November, 2016
Sample OpenACC pragma in HTC
#pragma acc parallel loop async(0) present(doi,ap,s))for(c=0;c<nc;c++){ doi[c] = s[c]/ap[c]; if(doi[c]<0.0) doi[c] = 0.0;}
● Use asynchronous kernel calls to allow co-computing on CPU
● Use explicit memory upload/download to always enable
usage of “present” clause
9 © 2016 ANSYS, Inc. November, 2016
OpenACC performance in HTC
Problem setup:Head lamp simulation of 1.4M and 11.5M casesVolume Monitor on Incident RadiationConvergence criterion of 1e-3 on volume monitor
CPU Hardware:(Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 2 socket x 14 = 28 cores
GPGPU Hardware:Tesla K80 12+12 GB, Driver 346.46
10 © 2016 ANSYS, Inc. November, 2016
OpenACC performance in HTC (cont)
7.8x6.1x 3.6x 3.1x
11 © 2016 ANSYS, Inc. November, 2016
OpenACC performance in HTC (cont)
Problem setup:0.58M case6x6 DO Resolution with 3 BandsFlow + Energy + DOSingle Precision200 iterations
CPU Hardware:(Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 2 socket x 14 = 28 cores
GPGPU Hardware:Tesla K80 12+12 GB, Driver 346.46
12 © 2016 ANSYS, Inc. November, 2016
OpenACC performance in HTC (cont)
6.7x6.3x 4.3x 2.4x
13 © 2016 ANSYS, Inc. November, 2016
Summary and future work
• Achievements
− Effectively using OpenACC for heterogeneous computing in Fluent
− Impressive performance achieved in Fluent with the OpenACC programming model
• Future work
− Extend the heterogeneous computing framework to more models
− Investigate more platforms like OpenPower and multi-core