occa: a unified approach to multi-threading …...occa-enabled applications • discontinuous...

OCCA

We are grateful and wish to acknowledge the grants from Shell, Department of Energy, and Argonne National Laboratory, together with support from the AMD and the Shell Fellowship

David Medina*, Amik St-Cyr† , & Tim Warburton* *Department of Computational and Applied Mathematics, Rice University; †Computation & Modeling, Shell

OCCA: A Unified Approach to Multi-Threading Languages

Finite Difference OCCA Kernel!// External variables are defined as compiler directives!occaKernel void fd2d(occaKernelInfoArg,! occaPointer double *u1,! occaPointer double *u2,! occaPointer double *u3){! occaOuterFor1{! occaOuterFor0{! occaShared double Lu[Gi + 2*r][Gj + 2*r]; // Shared storage! occaPrivate(double, r_u2);! occaPrivate(double, r_u2);!! occaInnerFor1{! occaInnerFor0{! const int Li = occaInnerId0;! const int Lj = occaInnerId1;!! const int i = occaGlobalId0;! const int j = occaGlobalId1;!! const int id = j*w + i; // w = nodes in x direction! // h = nodes in y direction! if( (i < w) && (j < h) ){ // Bounds check!! r_u2 = u2[id]; // Global to register memory! r_u3 = u3[id];!! Lu[Lj][Li] = u2[nY1*w + nX1];!! const int nX1 = (i - r + w) % w;! const int nY1 = (j - r + h) % h;!! const int nX2 = (i + Gi - sr + w) % w;! const int nY2 = (j + Gj - sr + h) % h;!! if(Lj < 2*r)! Lu[Lj + Gy][Li] = u2[nY2*w + nX1];!

SummaryRiDG

For each element:! For each volume node:! Compute divergence at volume node

Volume Kernel

Surface Flux Kernel

dqhdt

= −∇h ⋅Fh − Lh nh ⋅ Fh* − Fh

−( )( ) For each element:! For each surface node:! Compute flux at surface node! For each volume node:! Calculate flux contribution at volume node

∂q∂t

+∇ iF = S

Discontinuous Galerkin seismic forward modeling

GPU Platforms

Est.

GFL

OPS

0

250

500

750

1000

Polynomial Order

2 3 4 5 6

OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)

CPU Platforms

Est.

GFL

OPS

0

45

90

135

180

Polynomial Order

2 3 4 5 6

OpenCL-AMDOpenCL-IntelOpenMP

CPU Platforms (Vec)

Est.

GFL

OPS

0

45

90

135

180

Polynomial Order

2 3 4 5 6

OpenCL-AMDOpenCL-IntelOpenMP (g++)OpenMP (icpc)

General Conservative-Law Form

OCCA

CUDA-x86

CPU Processors

NVIDIA GPUs

FPGAs

AMD GPUs

Xeon Phi

CPU Processors

NVIDIA GPUs

FPGAs

AMD GPUs

Xeon Phi

PTX Assembly

GPU Ocelot

CU2CL

SWAN

PGI: CUDA-x86

OCCA-Enabled Applications

• Discontinuous Galerkin seismic forward modeling

RiDG

• Fully accelerated aggregation-based algebraic multi grid

ALMOND3

1

5

7

6

9

11

12

2

4

8

10

2

1

3

• Matrix-free preconditioned conjugate gradient - ALMOND-enabled - Spectral overlapping additive Schwarz pre conditioner

High-order Finite Elements for Elliptic Problems

• Current approaches translate CUDA for non-NVIDIA devices

• Translate to OpenCL due to similarity in languages

• GPU-Ocelot translates at the GPU assembly level (PTX, CAL, …)

Existing Approaches to Cross-Platform CUDA

• Unifies heterogenous platforms by abstracting language APIs

• Takes advantage of the similarity in platform optimization techniques

• Macro-based kernel language masks different supported languages

OCCA: Platform and Hardware Pairing

• Simplest wave model

Acoustic Wave Equation

ptt � c2�p = 0

• The programmer has to expose parallelism in memory access and operations.

Parallelism*:

• Naively porting OpenMP to CUDA or OpenCL likely will yield low performance.

• Manufacturers devote resources to enhancing specific APIs.

Performance*:

• Code life cycle measured in decades.

• Architecture & API life cycles measured in Moore doubling periods.

• Example: if you coded for the IBM Cell processor API…

Uncertainty:

• CUDA, OpenCL, OpenMP, OpenACC, Intel TBB, etc are not code compatible.

• Not all APIs are installed on any given system.

Portability:

CPU Platforms

Est.

MN

odes

/s

0

175

350

525

700

(1D) Stencil Size

5 7 9 11 13

OpenCL-AMDOpenCL-IntelOpenMP-icpcOpenMP-g++

GPU Platforms

0

3250

6500

9750

13000

(1D) Stencil Size

5 7 9 11 13

OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)

Project References1. J. Hesthaven & T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications, vol. 54, Springer, 2008.

2. R. Gandham, K. Esler, & Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Computers & Math with Applications.

3. D. Medina, R. Gandham, & T. Warburton, gNek: A GPU accelerated spectral-element Navier-stokes solver, (in progress).

occa: a unified approach to multi-threading …...occa-enabled applications • discontinuous...

Documents