occa: a unified approach to multi-threading …...occa-enabled applications • discontinuous...

1
OCCA We are grateful and wish to acknowledge the grants from Shell, Department of Energy, and Argonne National Laboratory, together with support from the AMD and the Shell Fellowship David Medina*, Amik St-Cyr , & Tim Warburton* *Department of Computational and Applied Mathematics, Rice University; Computation & Modeling, Shell OCCA: A Unified Approach to Multi-Threading Languages Finite Difference OCCA Kernel // External variables are defined as compiler directives occaKernel void fd2d(occaKernelInfoArg, occaPointer double *u1, occaPointer double *u2, occaPointer double *u3){ occaOuterFor1{ occaOuterFor0{ occaShared double Lu[Gi + 2*r][Gj + 2*r]; // Shared storage occaPrivate(double, r_u2); occaPrivate(double, r_u2); occaInnerFor1{ occaInnerFor0{ const int Li = occaInnerId0; const int Lj = occaInnerId1; const int i = occaGlobalId0; const int j = occaGlobalId1; const int id = j*w + i; // w = nodes in x direction // h = nodes in y direction if( (i < w) && (j < h) ){ // Bounds check r_u2 = u2[id]; // Global to register memory r_u3 = u3[id]; Lu[Lj][Li] = u2[nY1*w + nX1]; const int nX1 = (i - r + w) % w; const int nY1 = (j - r + h) % h; const int nX2 = (i + Gi - sr + w) % w; const int nY2 = (j + Gj - sr + h) % h; if(Lj < 2*r) Lu[Lj + Gy][Li] = u2[nY2*w + nX1]; Summary RiDG For each element: For each volume node: Compute divergence at volume node Volume Kernel Surface Flux Kernel dq h dt = −∇ h F h L h n h F h * F h ( ) ( ) For each element: For each surface node: Compute flux at surface node For each volume node: Calculate flux contribution at volume node q t + i F = S Discontinuous Galerkin seismic forward modeling GPU Platforms Est. GFLOPS 0 250 500 750 1000 Polynomial Order 2 3 4 5 6 OpenCL (Tahiti) OpenCL (Titan) CUDA (Titan) CPU Platforms Est. GFLOPS 0 45 90 135 180 Polynomial Order 2 3 4 5 6 OpenCL-AMD OpenCL-Intel OpenMP CPU Platforms (Vec) Est. GFLOPS 0 45 90 135 180 Polynomial Order 2 3 4 5 6 OpenCL-AMD OpenCL-Intel OpenMP (g++) OpenMP (icpc) General Conservative-Law Form OCCA CUDA-x86 CPU Processors NVIDIA GPUs FPGAs AMD GPUs Xeon Phi CPU Processors NVIDIA GPUs FPGAs AMD GPUs Xeon Phi PTX Assembly GPU Ocelot CU2CL SWAN PGI: CUDA-x86 OCCA-Enabled Applications Discontinuous Galerkin seismic forward modeling RiDG Fully accelerated aggregation-based algebraic multi grid ALMOND 3 1 5 7 6 9 11 12 2 4 8 10 2 1 3 Matrix-free preconditioned conjugate gradient - ALMOND-enabled - Spectral overlapping additive Schwarz pre conditioner High-order Finite Elements for Elliptic Problems Current approaches translate CUDA for non-NVIDIA devices Translate to OpenCL due to similarity in languages GPU-Ocelot translates at the GPU assembly level (PTX, CAL, …) Existing Approaches to Cross-Platform CUDA Unifies heterogenous platforms by abstracting language APIs Takes advantage of the similarity in platform optimization techniques Macro-based kernel language masks different supported languages OCCA: Platform and Hardware Pairing Simplest wave model Acoustic Wave Equation p tt - c 2 Δp =0 The programmer has to expose parallelism in memory access and operations. Parallelism*: Naively porting OpenMP to CUDA or OpenCL likely will yield low performance. Manufacturers devote resources to enhancing specific APIs. Performance*: Code life cycle measured in decades. Architecture & API life cycles measured in Moore doubling periods. Example: if you coded for the IBM Cell processor API… Uncertainty: CUDA, OpenCL, OpenMP, OpenACC, Intel TBB, etc are not code compatible. Not all APIs are installed on any given system. Portability: CPU Platforms Est. MNodes/s 0 175 350 525 700 (1D) Stencil Size 5 7 9 11 13 OpenCL-AMD OpenCL-Intel OpenMP-icpc OpenMP-g++ GPU Platforms 0 3250 6500 9750 13000 (1D) Stencil Size 5 7 9 11 13 OpenCL (Tahiti) OpenCL (Titan) CUDA (Titan) Project References 1. J. Hesthaven & T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications, vol. 54, Springer, 2008. 2. R. Gandham, K. Esler, & Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Computers & Math with Applications. 3. D. Medina, R. Gandham, & T. Warburton, gNek: A GPU accelerated spectral-element Navier-stokes solver, (in progress).

Upload: others

Post on 07-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OCCA: A Unified Approach to Multi-Threading …...OCCA-Enabled Applications • Discontinuous Galerkin seismic forward modeling RiDG • Fully accelerated aggregation-based algebraic

OCCA

We are grateful and wish to acknowledge the grants from Shell, Department of Energy, and Argonne National Laboratory, together with support from the AMD and the Shell Fellowship

David Medina*, Amik St-Cyr† , & Tim Warburton* *Department of Computational and Applied Mathematics, Rice University; †Computation & Modeling, Shell

OCCA: A Unified Approach to Multi-Threading Languages

Finite Difference OCCA Kernel!// External variables are defined as compiler directives!occaKernel void fd2d(occaKernelInfoArg,! occaPointer double *u1,! occaPointer double *u2,! occaPointer double *u3){! occaOuterFor1{! occaOuterFor0{! occaShared double Lu[Gi + 2*r][Gj + 2*r]; // Shared storage! occaPrivate(double, r_u2);! occaPrivate(double, r_u2);!! occaInnerFor1{! occaInnerFor0{! const int Li = occaInnerId0;! const int Lj = occaInnerId1;!! const int i = occaGlobalId0;! const int j = occaGlobalId1;!! const int id = j*w + i; // w = nodes in x direction! // h = nodes in y direction! if( (i < w) && (j < h) ){ // Bounds check!! r_u2 = u2[id]; // Global to register memory! r_u3 = u3[id];!! Lu[Lj][Li] = u2[nY1*w + nX1];!! const int nX1 = (i - r + w) % w;! const int nY1 = (j - r + h) % h;!! const int nX2 = (i + Gi - sr + w) % w;! const int nY2 = (j + Gj - sr + h) % h;!! if(Lj < 2*r)! Lu[Lj + Gy][Li] = u2[nY2*w + nX1];!

SummaryRiDG

For each element:! For each volume node:! Compute divergence at volume node

Volume Kernel

Surface Flux Kernel

dqhdt

= −∇h ⋅Fh − Lh nh ⋅ Fh* − Fh

−( )( ) For each element:! For each surface node:! Compute flux at surface node! For each volume node:! Calculate flux contribution at volume node

∂q∂t

+∇ iF = S

Discontinuous Galerkin seismic forward modeling

GPU Platforms

Est.

GFL

OPS

0

250

500

750

1000

Polynomial Order

2 3 4 5 6

OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)

CPU Platforms

Est.

GFL

OPS

0

45

90

135

180

Polynomial Order

2 3 4 5 6

OpenCL-AMDOpenCL-IntelOpenMP

CPU Platforms (Vec)

Est.

GFL

OPS

0

45

90

135

180

Polynomial Order

2 3 4 5 6

OpenCL-AMDOpenCL-IntelOpenMP (g++)OpenMP (icpc)

General Conservative-Law Form

OCCA

CUDA-x86

CPU Processors

NVIDIA GPUs

FPGAs

AMD GPUs

Xeon Phi

CPU Processors

NVIDIA GPUs

FPGAs

AMD GPUs

Xeon Phi

PTX Assembly

GPU Ocelot

CU2CL

SWAN

PGI: CUDA-x86

OCCA-Enabled Applications

• Discontinuous Galerkin seismic forward modeling

RiDG

• Fully accelerated aggregation-based algebraic multi grid

ALMOND3

1

5

7

6

9

11

12

2

4

8

10

2

1

3

• Matrix-free preconditioned conjugate gradient - ALMOND-enabled - Spectral overlapping additive Schwarz pre conditioner

High-order Finite Elements for Elliptic Problems

• Current approaches translate CUDA for non-NVIDIA devices

• Translate to OpenCL due to similarity in languages

• GPU-Ocelot translates at the GPU assembly level (PTX, CAL, …)

Existing Approaches to Cross-Platform CUDA

• Unifies heterogenous platforms by abstracting language APIs

• Takes advantage of the similarity in platform optimization techniques

• Macro-based kernel language masks different supported languages

OCCA: Platform and Hardware Pairing

• Simplest wave model

Acoustic Wave Equation

ptt � c2�p = 0

• The programmer has to expose parallelism in memory access and operations.

Parallelism*:

• Naively porting OpenMP to CUDA or OpenCL likely will yield low performance.

• Manufacturers devote resources to enhancing specific APIs.

Performance*:

• Code life cycle measured in decades.

• Architecture & API life cycles measured in Moore doubling periods.

• Example: if you coded for the IBM Cell processor API…

Uncertainty:

• CUDA, OpenCL, OpenMP, OpenACC, Intel TBB, etc are not code compatible.

• Not all APIs are installed on any given system.

Portability:

CPU Platforms

Est.

MN

odes

/s

0

175

350

525

700

(1D) Stencil Size

5 7 9 11 13

OpenCL-AMDOpenCL-IntelOpenMP-icpcOpenMP-g++

GPU Platforms

0

3250

6500

9750

13000

(1D) Stencil Size

5 7 9 11 13

OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)

Project References1. J. Hesthaven & T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications, vol. 54, Springer, 2008.

2. R. Gandham, K. Esler, & Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Computers & Math with Applications.

3. D. Medina, R. Gandham, & T. Warburton, gNek: A GPU accelerated spectral-element Navier-stokes solver, (in progress).