radiation solver with openacc accelerating the...

1 © 2016 ANSYS, Inc. November, 2016

Accelerating the ANSYS Fluent R18.0Radiation Solver with OpenACC

Sunil Sathe, Lead Software Developer


Outline

• Fluent heterogeneous computing (HTC) infrastructure

• PGI/OpenACC for the HTC infrastructure

• Build and execution model

• Discrete ordinate (DO) radiation solver options

• CPU/GPU co-computing

• Sample OpenACC pragma in HTC

• Performance

• Summary


Fluent HTC infrastructure

Domain DecompositionMPI Rank 0 MPI Rank 1

MPI Rank 2 MPI Rank 3

GPU GPU

GPU GPU

CPU Cell

GPU Cell

Case

CPU Domain GPU Domain

CPU Cells GPU Cells

CPU Faces GPU Faces

Abstract Domain

Abstract Cells

Abstract Faces

OpenACC CodeAlgorithms Data Structure


PGI/OpenACC for the HTC infrastructure

● Hardware portability

○ NVIDIA GPUs

○ Intel x86

○ Multi-core

○ OpenPower

○ ARM

● OS portability

○ Windows

○ Linux

● Performance portability

○ Competitive performance with best-in-class

compilers and programming models on individual platforms

● Ease of programming model

○ Simple pragma directives


Build and execution model

fluent_mpi.x.y.z libhtc.so

Fluent Native Source Code

Fluent HTC Source Code

Compiled with pgc++ OpenACC support

Dynamically loaded

Compiled

Executed


Discrete ordinate radiation solver options

Flue

nt N

ativ

e S

olve

r

Flue

nt N

ativ

e S

olve

r

Flue

nt N

ativ

e S

olve

r

Flue

nt N

ativ

e S

olve

r

CPU GPU CPU GPU CPU GPU CPU GPU

Fluent-HTCDO Solver

Fluent-HTCDO Solver

Fluent-HTCDO Solver

Fluent-HTCDO Solver

CPU Computation in Fluent Native DO Solver

CPU Computation in Fluent-HTC DO Solver

GPU Computation in Fluent-HTC DO Solver

CPU/GPU Hybrid Computation in Fluent-HTC DO Solver


CPU/GPU co-computing

Time

Launch GPU kernels asynchronously

Compute on CPU simultaneously

Wait for GPU to finish

● Divide work between CPU and GPU

● Run the same code on both CPU and GPU

● Use OpenACC “loop” pragma to accelerate the GPU work


Sample OpenACC pragma in HTC

#pragma acc parallel loop async(0) present(doi,ap,s))for(c=0;c<nc;c++){ doi[c] = s[c]/ap[c]; if(doi[c]<0.0) doi[c] = 0.0;}

● Use asynchronous kernel calls to allow co-computing on CPU

● Use explicit memory upload/download to always enable

usage of “present” clause


OpenACC performance in HTC

Problem setup:Head lamp simulation of 1.4M and 11.5M casesVolume Monitor on Incident RadiationConvergence criterion of 1e-3 on volume monitor

CPU Hardware:(Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 2 socket x 14 = 28 cores

GPGPU Hardware:Tesla K80 12+12 GB, Driver 346.46


OpenACC performance in HTC (cont)

7.8x6.1x 3.6x 3.1x



Problem setup:0.58M case6x6 DO Resolution with 3 BandsFlow + Energy + DOSingle Precision200 iterations

CPU Hardware:(Haswell EP) Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 2 socket x 14 = 28 cores

GPGPU Hardware:Tesla K80 12+12 GB, Driver 346.46



6.7x6.3x 4.3x 2.4x


Summary and future work

• Achievements

− Effectively using OpenACC for heterogeneous computing in Fluent

− Impressive performance achieved in Fluent with the OpenACC programming model

• Future work

− Extend the heterogeneous computing framework to more models

− Investigate more platforms like OpenPower and multi-core

radiation solver with openacc accelerating the...

Documents