opencl programming in detail - home -...

63
OpenCL Programming in Detail N-Body Algorithm Tutorial David Richie | November 2010

Upload: others

Post on 16-Mar-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

OpenCL Programming in DetailN-Body Algorithm Tutorial

David Richie | November 2010

Page 2: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20102

Agenda

Brief review of OpenCL™ concepts

Use of STDCL to simplify host code programming

Code walk-through: N-body algorithm tutorial

Page 3: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20103

Hybrid CPU/GPU Architectures and OpenCL™

CPU CPUMemory

PCIe

GPU GPU

Memory

GPU GPU

Memory

Issues

– Distributed memory management

– Concurrency

– Platform/vendor portability of APIs

OpenCL provides ...

– Platform and runtime layer for managing concurrent execution of operations across multiple devices

– C language extension for programming devices such as CPUs/GPUs

– Platform/device independent API with broad industry support

Page 4: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20104

Anatomy of OpenCL™

Language Specification

– Based on ISO C99 with added extension and restrictions

Platform API

– Application routines to query system and setup OpenCL™ resources

Runtime API

– Manage kernels objects, memory objects, and executing kernels on OpenCL™ devices

Page 5: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20105

OpenCL™ Architecture – Execution Model

Kernel:

– Basic unit of executable code that runs on OpenCL™ devices

– Data-parallel or task-parallel

Host program:

– Executes on the host system

– Sends kernels to execute on OpenCL™ devices using command queue

Page 6: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20106

OpenCL™ C Language - Kernel Programming

Language based on ISO C99

– Some restrictions

Additions to language for parallelism

– Vector types

– Work-items/group functions

– Synchronization

Address Space Qualifiers

Built-in Functions

Page 7: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20107

Kernels – Expressing Data-Parallelism

Define N-dimensional computation domain

– N = 1, 2, or 3

– Each element in the domain is called a work-item

– N-D domain (global dimensions) defines the total work-items that execute in parallel

– Each work-item executes the same kernel

Example:

– Processing a 1024x1024 image, kernel would be executed 1,048,596 times over a 2D computational domain

Page 8: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20108

Kernels: Work-item and Work-group

Work-items are grouped into work-groups

– Local dimensions define the size of the work-groups

– Execute together on same compute unit

– Share local memory and synchronization

32

32

Synchronization between

work-items possible only

within work-groups

Cannot synchronize

between workgroups

Page 9: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 20109

Execution Model – Host Program

Create “Context” to manage OpenCL™ resources:

– Devices – OpenCL™ devices to execute kernels

– Program Objects - source or binary that implements kernel functions

– Kernels – the specific function to execute on the devices

– Memory Objects – memory buffers common to the host and devices supporting distributed memory management

Page 10: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201010

Execution Model – Command Queue

Manage execution of kernels

Accepts:

– Kernel execution commands

– Memory commands

– Synchronization commands

Queued in-order

Execute in-order or out-of-order

Page 11: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201011

Memory Model

Global – read and write by all work-items and work-groups

Constant – read-only by work-items; read and write by host

Local – used for data sharing; read/write by work-items in same work-group

Private – only accessible to one work-item

Host Memory

Global/Constant Memory

Local Memory

Work-item

Work-item

Private Memory

Private Memory

Workgroup

Host

Compute Device

Local Memory

Work-item

Work-item

Private Memory

Private Memory

Workgroup

Memory management is explicit

– Must move data from host to global to local and back

Page 12: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201012

Synchronization

OpenCL™ is designed to support concurrent execution of kernels and memory transfers across multiple devices

Programmer is responsible for synchronization using events and/or blocking calls

– Events can be used to prevent one operation from executing until one or more operations have finished

– Programmer can explicitly block on one or more events on the host side

Page 13: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201013

STDCL: A Simplified Interface to OpenCL™

Page 14: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201014

STDCL

Idea:

– OpenCL™ provides explicit, platform/device-independent control over execution and data movement

– In practice this can be tedious, the syntax/semantics can be verbose

– Provide simplified interface based on typical use cases in a familiar UNIX style

Here the host code will use STDCL for simplicity

– Allows focus on the concepts, which are complicated enough, without getting lost in low-level syntax

– No restrictions on the functionality provided by OpenCL

– Use of such an API is inevitable for any serious programming project

– Free, open-source, LGPLv3 license

Page 15: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201015

STDCL (1/4)

Obtaining an OpenCL™ compute layer "context"

– OpenCL: (1) query platforms, (2) select platform, (3) get devices, (4) create contexts for each device, (5) create command queues for each device

– STDCL provides default contexts stddev, stdcpu, stdgpu, ... that

are "ready to go"

#include “stdcl.h”

CONTEXT* stddev; // contains all devices

CONTEXT* stdcpu; // contains all CPU devices

CONTEXT* stdgpu; // contains all GPU devices

Link with -lstdcl

Page 16: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201016

STDCL (2/4)

Managing OpenCL™ kernels

– OpenCL: (1) manage program text, (2) create program, (3) build program, (4) create kernel

– STDCL provides clopen, clsym, clclose

– Use:

#include “stdcl.h”

void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );

cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );

...

clclose( stdgpu, clh );

Page 17: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201017

STDCL (3/4)

OpenCL™ memory management

– OpenCL: requires use of opaque memory buffers, enqueueingreadbuffer/writebuffer commands

– STDCL provides clmalloc, clfree for allocation of shareable memory

across OpenCL devices

– Use:

#include “stdcl.h”

cl_float4* pos = (cl_float4*)clmalloc( stdgpu, np*sizeof(cl_float4), 0);

...

clfree(pos);

Page 18: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201018

STDCL (4/4)

Managing execution of concurrent events

– OpenCL: requires enqueueing and managing events

– STDCL provides clmsync, clfork, clwait

– Use:

#include “stdcl.h”

clmsync( stdgpu, 0, pos, CL_MEM_DEVICE|CL_EVENT_NOWAIT );

...

clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );

...

clmsync( stdgpu, 0, pos, CL_MEM_HOST|CL_EVENT_NOWAIT );

...

clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );

Page 19: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201019

Code Walk-Through: N-Body Algorithm Tutorial

Page 20: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201020

Code Walk-Through: N-Body Algorithm

Basic N-body algorithm

OpenCL program structure

Implementation:

– Kernel code

– Host code

Compilation

Two-GPU implementation

– Modified kernel code

– Modified host code

Page 21: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201021

rj - ri

| rj – ri |3mj

i j

fi =

| rj – ri |

ri

rj

Basic N-Body Algorithm

Models motion of N particles subject to particle-particle interaction, e.g.,

– Gravitational force

– Charged particles

Computation is O(N2)

Algorithm has two main steps:

– Calculate total force on each particle

– Update particle position/velocity over some small time-step (Newtonian dynamics)

Entire (unoptimized) algorithm can be written in C with a few dozen lines of code

Page 22: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201022

Basic N-Body Code (1/2)

// For each particle "i" ...1 for(i=0; i<n; i++) {2 ax = 0.0f;3 ay = 0.0f;4 az = 0.0f;

// Loop over all particles "j" and accumulate force on particle "i"5 for(j=0; j<n; j++) {6 dx = x[j] - x[i]; 7 dy = y[j] - y[i]; 8 dz = z[j] - z[i];9 invr = 1.0f/sqrt(dx * dx + dy * dy + dz * dz + eps);10 invr3 = invr * invr * invr;11 f = m[j]*invr3;12 ax += f * dx;13 ay += f * dy;14 az += f * dx;15 }

For each particle "i" accumulate all forces due to particle "j"

Page 23: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201023

Basic N-Body Code (2/2)

// update position and velocity of particle "i"16 x_new[i] = x[i] + dt * vx[i] + 0.5f * dt * dt * ax; 17 y_new[i] = y[i] + dt * vy[i] + 0.5f * dt * dt * ay;18 z_new[i] = z[i] + dt * vz[i] + 0.5f * dt * dt * az;19 vx[i] += dt * ax; 20 vy[i] += dt * ay;21 vz[i] += dt * az;22 }

// copy updated positions back into original arrays23 for(i=0; i<n; i++) {24 x[i] = x_new[i];25 y[i] = y_new[i];26 z[i] = z_new[i];27 }

Update positions/velocities using the acceleration (a=F/m)

Repeat

Page 24: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201024

OpenCL Program Structure

Page 25: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201025

OpenCL Program Structure

OpenCL implementation will consist of two parts:

Kernel code

– Compiled to run on the GPU, performs the actual computation

– Typically based on critical loops within larger program

– Intended to provide accelerated version of a given algorithm

Host code

– Performs no meaningful computations, still important

– Especially important for using multiple devices

– Initialization and bookkeeping tasks

– Coordinate operations on the OpenCL device(s), e.g.,

Memory management

Kernel execution

Page 26: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201026

Implementation: Kernel Code

Page 27: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201027

Kernel Code

Goal is to provide a reasonably standard implementation that is understandable

Attempt to use good practices from an OpenCL perspective, however, it is very likely not optimal for a particular architecture

Important to remember context of OpenCL kernel code:

– Kernel will be executed for every work-item (enumerated thread) within an index-space (range of enumerated threads)

– This application has a one-dimensional index-space with a number of work-items equal to the number of particles in the system

– Kernel code will be invoked once for each of the N particles

– Task for the kernel code is to update the position and velocity of one particle using Newtonian mechanics

Page 28: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201028

Implementation: Kernel Code (1/5)

1 // nbody_kern.cl

2 __kernel void nbody_kern(3 float dt1, float eps,4 __global float4* pos_old,5 __global float4* pos_new,6 __global float4* vel,7 __local float4* pblock8 ) {

Prototype for kernel

Very similar to a function prototype with few exceptions

– Must by given the qualifier __kernel

– Pointer arguments must be qualified to reflect correct address space, e.g., __global, __local, etc.

Note: kernel code is placed in separate file with the extension

“.cl” to distinguish it from ordinary C

Page 29: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201029

Implementation: Kernel Code (2/5)

9 const float4 dt = (float4)( dt1, dt1, dt1, 0.0f );

10 int gti = get_global_id(0); // relative to global index-space11 int ti = get_local_id(0); // relative to local work-group

12 int n = get_global_size(0); // global index-space13 int nt = get_local_size(0); // local work-group14 int nb = n/nt;

Size/Index determination and other bookkeeping

Built-in functions allow each work-item (thread) to self-

identify its role in the parallel execution of the kernel over the index-space

Example assuming N = 8192 particles:

– Typical values for Cypress GPU would be n=8192, nt=64, nb=128

– Therefore, 0≤gti<8192 and 0≤ti<64

Page 30: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201030

Implementation: Kernel Code (3/5)

15 float4 p = pos_old[gti];16 float4 v = vel[gti];

17 float4 a = (float4)( 0.0f, 0.0f, 0.0f, 0.0f );

// For each block ...18 for(int jb=0; jb < nb; jb++) {

// Cache ONE particle position19 pblock[ti] = pos_old[ jb * nt + ti];

// Wait for others in the work-group20 barrier(CLK_LOCAL_MEM_FENCE);

Using local memory as a cache for blocking

Work-items (threads) in work-group perform cooperative read to fill cache

– Each thread copies value from global memory to local memory, expects the other threads to do the same

– barrier used for synchronization

Page 31: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201031

Implementation: Kernel Code (4/5)

// For each cached particle position ...21 for(int j=0; j<nt; j++) {

// Accumulate force/acceleration22 float4 p2 = pblock[j];23 float4 d = p2 - p;24 float invr = rsqrt(d.x * d.x + d.y * d.y + d.z * d.z + eps);25 float f = p2.w * invr * invr * invr;26 a += f * d;27 }

// Wait for others in work-group28 barrier(CLK_LOCAL_MEM_FENCE);29 }

Perform force calculation using blocking

Inner loop is over the block size where particle positions were cached

barrier is required to prevent overwriting the cache until all

work-items (threads) are done

Page 32: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201032

Implementation: Kernel Code (5/5)

30 p += dt * v + 0.5f * dt * dt * a;31 v += dt * a;

32 pos_new[gti] = p;33 vel[gti] = v;

34 }

Position and velocity update

Note: we are not updating the original particle position array, but instead are using a double-buffering scheme

Page 33: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201033

Implementation: Host Code

Page 34: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201034

Host Code Implementation (1/8)

// nbody.c (one GPU)2 #include <stdcl.h>

3 void nbody_init( int n, cl_float4* pos, cl_float4* vel );4 void nbody_output( int n, cl_float4* pos, cl_float4* vel);

5 int main(int argc, char** argv) {

6 int step, burst;

7 int nparticle = 8192; // Must be power of two for simplicity8 int nstep = 100;9 int nburst = 20; // Must divide nstep without remainder10 int nthread = 64; // chosen for ATI Radeon HD 5870

11 float dt = 0.0001f;12 float eps = 0.0001f;

Initialization of the program

Note: including stdcl.h includes CL/cl.h automatically

Page 35: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201035

Host Code Implementation (2/8)

size_t nparticle_sz = nparticle * sizeof(cl_float4);

13 cl_float4* pos1 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);

14 cl_float4* pos2 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);

15 cl_float4* vel = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);

Contexts and memory allocation

Notes:

– stdgpu is a CONTEXT* provided by stdcl.h that includes all devices

of type GPU with everything ready to use

No set-up of platform, devices, contexts, command queues is required

– Memory allocated with clmalloc is accessible to OpenCL devices

No need to create opaque OpenCL “buffers”

Programmer can manage memory using conventional semantics

Page 36: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201036

Host Code Implementation (3/8)

16 nbody_init( nparticle, pos1, vel );

17 void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );18 cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );

Initialize particle positions and velocities

Load and compile OpenCL kernels

Notes:

– clopen and clsym manage the OpenCL kernels

Modeled after traditional UNIX dynamic loader dlopen and dlsym

No need to manage program text, create program, build program, create kernel

Your kernel is ready to use with only two calls

Page 37: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201037

Host Code Implementation (4/8)

19 clndrange_t ndr = clndrange_init1d( 0, nparticle, nthread );

20 clarg_set( krn, 0, dt );21 clarg_set( krn, 1, eps );22 clarg_set_global( krn, 4, vel );23 clarg_set_local( krn, 5, nthread * sizeof(cl_float4) );

Set up computational domain and kernel arguments

Notes:

– The N-dimensional range of the index-space is stored in a simple struct using the format { offset, global size, work-group size }

– Why did we skip arguments 2 and 3? - these arguments will not be static, but rather will be switched with a double-buffer scheme

Page 38: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201038

Host Code Implementation (5/8)

24 clmsync( stdgpu, 0, pos1, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);

25 clmsync( stdgpu, 0, vel, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);

Copy particle positions and velocities to GPU

Notes:

– Device ID is "0" in this case - assuming we have one GPU

– In this case the flags indicate blocking calls

– clmsync will initiate OpenCL enqueueReadBuffer or

enqueueWriteBuffer commands as needed

Page 39: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201039

Host Code Implementation (6/8)

26 for(step=0; step<nstep; step+=nburst) {

27 for(burst=0; burst<nburst; burst+=2) {

28 clarg_set_global( krn, 2, pos1 );29 clarg_set_global( krn, 3, pos2 );30 clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );

31 clarg_set_global( krn, 2, pos2 );32 clarg_set_global( krn, 3, pos1 );33 clfork( stdgpu, 0, krn, &ndr, CL_EVENT_NOWAIT );

34 }

Kernel execution using double-buffer scheme

Notes:

– Arguments 2 and 3 must be set just prior to kernel execution since we are switching between two arrays

– clfork will initiate an OpenCL enqueueNDRange command

– In this case clfork calls are non-blocking

Page 40: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201040

Host Code Implementation (7/8)

35 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_EVENT_RELEASE );

36 clmsync( stdgpu, 0, pos1, CL_MEM_HOST|CL_EVENT_WAIT|CL_EVENT_RELEASE );

37 }

Synchronization and read back of data

clwait blocks until the multiple kernel executions are completed

clmsync will copy particle positions back to the host

Both calls are specific to device ID "0", i.e., the GPU

Page 41: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201041

Host Code Implementation (8/8)

36 nbody_output( nparticle, pos1, vel );

37 clclose( stdgpu, clh );

38 clfree( pos1 );39 clfree( pos2 );40 clfree( vel );

41 }

Output results and clean up resources

Page 42: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201042

Compilation

Page 43: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201043

Compilation

1 ### Makefile for N-Body program2 NAME = nbody3 OBJS = nbody_init.o nbody_output.o4 OPENCL = /usr/local/atistream5 STDCL = /usr/local/browndeer6 INCS += -I$(OPENCL)/include -I$(STDCL)/include7 LIBS += -L$(OPENCL)/lib/x86_64 -lOpenCL -lpthread -ldl \

-L$(STDCL)/lib -lstdcl8 CFLAGS += -O3

9 all: $(NAME).x

10 $(NAME).x: $(NAME).o $(OBJS)11 $(CC) $(CFLAGS) $(INCS) -o $(NAME).x $(NAME).o $(OBJS) $(LIBS)

12 .SUFFIXES:13 .SUFFIXES: .c .o

14 .c.o:15 $(CC) $(CFLAGS) $(INCS) -c $<

Makefile shows required includes and links

Modify to suit your local setup

Page 44: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201044

Using Multiple OpenCL Devices

Example using two GPUs, e.g.

Divide the work for the force calculation and particle position/velocity update across two devices

Synchronization and memory management must be handled explicitly, and with some care

0, 1, 2, ..., N/2 – 1, N/2, ..., N-1

GPU 0 GPU 1

Particle Data

Page 45: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201045

Implementation for Two GPUs: Kernel Code

Page 46: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201046

Two-Device Kernel Code (1/6)

1 // nbody_kern.cl (two devices)

2 __kernel void nbody_kern(3 float dt1, float eps,4 __global float4* pos_old,5 __global float4* pos_new,6 __global float4* vel,7 __local float4* pblock,8 __global float4* pos_remote9 ) {

Prototype for kernel

Note: an additional pointer is added to point to the second "remote" array of particles - they are need to calculate the total force on the particles being updated

Page 47: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201047

Two-Device Kernel Code (2/6)

9 const float4 dt = (float4)( dt1, dt1, dt1, 0.0f );

10 int gti = get_global_id(0);11 int ti = get_local_id(0);

12 int n = get_global_size(0);13 int nt = get_local_size(0);14 int nb = n/nt;

Size and Index determination and other bookkeeping

No change from previous kernel

Page 48: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201048

Two-Device Kernel Code (3/6)

15 float4 p = pos_old[gti];16 float4 v = vel[gti];

17 float4 a = (float4)( 0.0f, 0.0f, 0.0f, 0.0f );

// For each block ...18 for(int jb=0; jb < nb; jb++) {

// Cache ONE local particle position19 pblock[ti] = pos_old[jb * nt + ti];

// Wait for others in work-group20 barrier(CLK_LOCAL_MEM_FENCE);

Loop over blocks with caching of particle positions

No change from previous kernel accept now we distinguish local and remote particles

Page 49: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201049

Two-Device Kernel Code (4/6)

// For each cached local particle position ...21 for(int j=0; j<nt; j++) {

// Accumulate acceleration22 float4 p2 = pblock[j];23 float4 d = p2 - p;24 float invr = rsqrt( d.x * d.x + d.y * d.y + d.z * d.z + eps );25 float f = p2.w * invr * invr * invr;26 a += f * d;27 }

// Wait for others in work-group28 barrier(CLK_LOCAL_MEM_FENCE);

Perform force calculation using blocking

Again, no change from previous kernel accept distinction between local and remote particles

Page 50: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201050

Two-Device Kernel Code (5/6)

// Cache one remote particle position29 pblock[ti] = pos_remote[jb * nt + ti];

// Wait for others in the work-group30 barrier(CLK_LOCAL_MEM_FENCE);

// For each cached remote particle position31 for(int j=0; j<nt; j++) {

// Accumulate acceleration32 float4 p2 = pblock[j];33 float4 d = p2 - p;34 float invr = rsqrt(d.x * d.x + d.y * d.y + d.z * d.z + eps);35 float f = p2.w * invr * invr * invr;36 a += f * d;37 }

// Wait for others in work-group38 barrier(CLK_LOCAL_MEM_FENCE);39 }

This code repeats the above code for remote particles

Only difference is global array from which positions are read

Page 51: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201051

Two-Device Kernel Code (6/6)

40 p += dt * v + 0.5f * dt * dt * a;41 v += dt * a;

42 pos_new[gti] = p;43 vel[gti] = v;

44 }

Particle and velocity update

No change from previous kernel

Page 52: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201052

Two-GPU Implementation: Host Code

Page 53: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201053

Two-Device Host Code (1/9)

// nbody2.c (two GPUs) 2 #include <stdcl.h>

3 void nbody_init( int n, cl_float4* pos, cl_float4* vel );4 void nbody_output( int n, cl_float4* pos, cl_float4* vel);

5 int main(int argc, char** argv) {

6 int step,burst;

7 int nparticle = 8192; // Must be power of two for simplicity

9 int nstep = 100;10 int nburst = 20; // Must divide nstep without remainder11 int nthread = 64; // chosen for ATI Radeon HD 5870

12 float dt = 0.0001f;13 float eps = 0.0001f;

Initialization of program - no change

Page 54: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201054

Two-Device Host Code (2/9)

size_t nparticle_sz = nparticle * sizeof(cl_float4);

14 cl_float4* pos1 = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);

15 cl_float4* pos1a = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);16 cl_float4* pos1b = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);

17 cl_float4* pos2a = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);18 cl_float4* pos2b = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);

19 cl_float4* vel = (cl_float4*)clmalloc( stdgpu, nparticle_sz, 0);

20 cl_float4* vela = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);21 cl_float4* velb = (cl_float4*)clmalloc( stdgpu, nparticle_sz/2, 0);

Memory allocation

We need global storage for N particles as well as half-sized storage for partitioning particle data across two devices

The designations "a" and "b" will correspond to the two GPUs

Page 55: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201055

Two-Device Host Code (3/9)

22 nbody_init( nparticle, pos, vel);

23 memcpy( pos1a, pos, nparticle/2 * sizeof(cl_float4) );24 memcpy( pos1b, pos + nparticle/2, nparticle/2 * sizeof(cl_float4) );25 memcpy( vela, vel, nparticle/2 * sizeof(cl_float4) );26 memcpy( velb, vel + nparticle/2, nparticle/2 * sizeof(cl_float4) );

27 void* clh = clopen( stdgpu, "nbody_kern.cl", CLLD_NOW );28 cl_kernel krn = clsym( stdgpu, clh, "nbody_kern", CLLD_NOW );

Particle positions and velocities are initialized as before

The memcpy calls partition the data into half-sized arrays

to distribute the particles across the two GPUs

Load and compile OpenCL kernels - no change

Page 56: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201056

Two-Device Host Code (4/9)

29 clndrange_t ndr2 = clndrange_init1d( 0, nparticle/2, nthread );

30 clarg_set( krn, 0, dt );31 clarg_set( krn, 1, eps );32 clarg_set_local( krn, 5, nthread * sizeof(cl_float4) );

Set up computational domain and kernel arguments

Only a small change is made to account for the computational domain per device being reduced by a factor of 2 since the work is being distributed across two devices

Page 57: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201057

Two-Device Host Code (5/9)

33 clmsync( stdgpu, 0, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );34 clmsync( stdgpu, 0, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );35 clmsync( stdgpu, 0, vela, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );

36 clmsync( stdgpu, 1, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );37 clmsync( stdgpu, 1, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );38 clmsync( stdgpu, 1, velb, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE);

Copy particle positions and velocities to the two GPUs

The device ID is used to indicate which GPU is involved in the copy, either "0" or "1"

Page 58: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201058

Two-Device Host Code (6/9)

39 for(step=0; step<nstep; step+=nburst) {

40 for(burst=0; burst<nburst; burst+=2) {

41 clarg_set_global( krn, 2, pos1a );42 clarg_set_global( krn, 3, pos2a );43 clarg_set_global( krn, 4, vela );44 clarg_set_global( krn, 6, pos1b );45 clfork( stdgpu, 0, krn, &ndr2, CL_EVENT_NOWAIT ); // GPU 0

46 clarg_set_global( krn, 2,pos1b );47 clarg_set_global( krn, 3,pos2b );48 clarg_set_global( krn, 4,velb );49 clarg_set_global( krn, 6,pos1a );50 clfork( stdgpu, 1, krn, &ndr2, CL_EVENT_NOWAIT ); // GPU 1

Kernel execution on both GPUs

Note that the clfork calls are non-blocking

Page 59: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201059

Two-Device Host Code (7/9)

51 clmsync( stdgpu, 0, pos2a, CL_MEM_HOST|CL_EVENT_NOWAIT );52 clmsync( stdgpu, 1, pos2b, CL_MEM_HOST|CL_EVENT_NOWAIT );

53 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );54 clwait( stdgpu, 1, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );

55 clmsync( stdgpu, 0, pos2b, CL_MEM_DEVICE|CL_EVENT_NOWAIT );56 clmsync( stdgpu, 1, pos2a, CL_MEM_DEVICE|CL_EVENT_NOWAIT );

CPUMemory

PCIe

GPU Memory GPU Memory

Exchange particle position arrays between GPUs

The device ID is used to indicate which GPU is involved in each call, either "0" or "1"

Note the use of clwait for synchronization

Page 60: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201060

Two-Device Host Code (8/9)57 clarg_set_global( krn, 2, pos2a );58 clarg_set_global( krn, 3, pos1a );59 clarg_set_global( krn, 4, vela );60 clarg_set_global( krn, 6, pos2b );61 clfork( stdgpu, 0, krn, &ndr2, CL_EVENT_NOWAIT );

62 clarg_set_global( krn,2, pos2b );63 clarg_set_global( krn,3, pos1b );64 clarg_set_global( krn,4, velb );65 clarg_set_global( krn,6, pos2a );66 clfork( stdgpu, 1, krn, &ndr2, CL_EVENT_NOWAIT );

67 clmsync( stdgpu, 0, pos1a, CL_MEM_HOST|CL_EVENT_NOWAIT );68 clmsync( stdgpu, 1, pos1b, CL_MEM_HOST|CL_EVENT_NOWAIT );

69 clwait( stdgpu, 0, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );70 clwait( stdgpu, 1, CL_KERNEL_EVENT|CL_MEM_EVENT|CL_EVENT_RELEASE );

71 clmsync( stdgpu, 0, pos1b, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );72 clmsync( stdgpu, 1, pos1a, CL_MEM_DEVICE|CL_EVENT_WAIT|CL_EVENT_RELEASE );

73 }

Kernel execution and exchange - everything repeated for the double-buffer scheme

The bookkeeping is left as an exercise

Page 61: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201061

Two-Device Host Code (9/9)

68 memcpy( pos1, pos1a, nparticle/2 * sizeof(cl_float4) );69 memcpy( pos1 + nparticle/2, pos1b, nparticle/2 * sizeof(cl_float4) );

70 nbody_output( nparticle, pos1, vel );

71 clclose( stdgpu, clh );

72 clfree( pos1 );73 clfree( pos1a );74 clfree( pos1b );75 clfree( pos2 );76 clfree( pos2 );77 clfree( vel ); 78 clfree( vela );79 clfree( velb );

80 }

Output results and clean-up resources

Note the additional step to copy the half-sized arrays into the original array for all particles

Page 62: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201062

Resources

OpenCL™ standard

– http://www.khronos.org/opencl

ATI Stream SDK

– http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx

– Note: tutorial code tested against version 2.1

STDCL

– Download:

http://www.browndeertechnology.com/stdcl.html#download OR

https://github.com/browndeer/coprthr/tree/stable-1.0

– Documentation:

http://www.browndeertechnology.com/docs/stdcl-manual.html

Page 63: OpenCL Programming in Detail - Home - AMDdeveloper.amd.com/wordpress/media/2012/10/OpenCLProginDetailDRichiev31.pdf7 | OpenCL Programming in Detail | November 8, 2010 Kernels –Expressing

| OpenCL Programming in Detail | November 8, 201063

Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2009 Advanced Micro Devices, Inc. All rights reserved.