occa: a unified approach to multi-threading …...occa-enabled applications • discontinuous...
TRANSCRIPT
OCCA
We are grateful and wish to acknowledge the grants from Shell, Department of Energy, and Argonne National Laboratory, together with support from the AMD and the Shell Fellowship
David Medina*, Amik St-Cyr† , & Tim Warburton* *Department of Computational and Applied Mathematics, Rice University; †Computation & Modeling, Shell
OCCA: A Unified Approach to Multi-Threading Languages
Finite Difference OCCA Kernel!// External variables are defined as compiler directives!occaKernel void fd2d(occaKernelInfoArg,! occaPointer double *u1,! occaPointer double *u2,! occaPointer double *u3){! occaOuterFor1{! occaOuterFor0{! occaShared double Lu[Gi + 2*r][Gj + 2*r]; // Shared storage! occaPrivate(double, r_u2);! occaPrivate(double, r_u2);!! occaInnerFor1{! occaInnerFor0{! const int Li = occaInnerId0;! const int Lj = occaInnerId1;!! const int i = occaGlobalId0;! const int j = occaGlobalId1;!! const int id = j*w + i; // w = nodes in x direction! // h = nodes in y direction! if( (i < w) && (j < h) ){ // Bounds check!! r_u2 = u2[id]; // Global to register memory! r_u3 = u3[id];!! Lu[Lj][Li] = u2[nY1*w + nX1];!! const int nX1 = (i - r + w) % w;! const int nY1 = (j - r + h) % h;!! const int nX2 = (i + Gi - sr + w) % w;! const int nY2 = (j + Gj - sr + h) % h;!! if(Lj < 2*r)! Lu[Lj + Gy][Li] = u2[nY2*w + nX1];!
SummaryRiDG
For each element:! For each volume node:! Compute divergence at volume node
Volume Kernel
Surface Flux Kernel
dqhdt
= −∇h ⋅Fh − Lh nh ⋅ Fh* − Fh
−( )( ) For each element:! For each surface node:! Compute flux at surface node! For each volume node:! Calculate flux contribution at volume node
∂q∂t
+∇ iF = S
Discontinuous Galerkin seismic forward modeling
GPU Platforms
Est.
GFL
OPS
0
250
500
750
1000
Polynomial Order
2 3 4 5 6
OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)
CPU Platforms
Est.
GFL
OPS
0
45
90
135
180
Polynomial Order
2 3 4 5 6
OpenCL-AMDOpenCL-IntelOpenMP
CPU Platforms (Vec)
Est.
GFL
OPS
0
45
90
135
180
Polynomial Order
2 3 4 5 6
OpenCL-AMDOpenCL-IntelOpenMP (g++)OpenMP (icpc)
General Conservative-Law Form
OCCA
CUDA-x86
CPU Processors
NVIDIA GPUs
FPGAs
AMD GPUs
Xeon Phi
CPU Processors
NVIDIA GPUs
FPGAs
AMD GPUs
Xeon Phi
PTX Assembly
GPU Ocelot
CU2CL
SWAN
PGI: CUDA-x86
OCCA-Enabled Applications
• Discontinuous Galerkin seismic forward modeling
RiDG
• Fully accelerated aggregation-based algebraic multi grid
ALMOND3
1
5
7
6
9
11
12
2
4
8
10
2
1
3
• Matrix-free preconditioned conjugate gradient - ALMOND-enabled - Spectral overlapping additive Schwarz pre conditioner
High-order Finite Elements for Elliptic Problems
• Current approaches translate CUDA for non-NVIDIA devices
• Translate to OpenCL due to similarity in languages
• GPU-Ocelot translates at the GPU assembly level (PTX, CAL, …)
Existing Approaches to Cross-Platform CUDA
• Unifies heterogenous platforms by abstracting language APIs
• Takes advantage of the similarity in platform optimization techniques
• Macro-based kernel language masks different supported languages
OCCA: Platform and Hardware Pairing
• Simplest wave model
Acoustic Wave Equation
ptt � c2�p = 0
• The programmer has to expose parallelism in memory access and operations.
Parallelism*:
• Naively porting OpenMP to CUDA or OpenCL likely will yield low performance.
• Manufacturers devote resources to enhancing specific APIs.
Performance*:
• Code life cycle measured in decades.
• Architecture & API life cycles measured in Moore doubling periods.
• Example: if you coded for the IBM Cell processor API…
Uncertainty:
• CUDA, OpenCL, OpenMP, OpenACC, Intel TBB, etc are not code compatible.
• Not all APIs are installed on any given system.
Portability:
CPU Platforms
Est.
MN
odes
/s
0
175
350
525
700
(1D) Stencil Size
5 7 9 11 13
OpenCL-AMDOpenCL-IntelOpenMP-icpcOpenMP-g++
GPU Platforms
0
3250
6500
9750
13000
(1D) Stencil Size
5 7 9 11 13
OpenCL (Tahiti)OpenCL (Titan)CUDA (Titan)
Project References1. J. Hesthaven & T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications, vol. 54, Springer, 2008.
2. R. Gandham, K. Esler, & Y. Zhang, A GPU accelerated aggregation algebraic multigrid method, Computers & Math with Applications.
3. D. Medina, R. Gandham, & T. Warburton, gNek: A GPU accelerated spectral-element Navier-stokes solver, (in progress).