tutorialbasedlecturenotesonopencl01overview openclspeciﬁcation firstopenclprogram thelanguage 1...

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Tutorial Based Lecture Notes on OpenCL 01

For PhD level courseParallel & Distributed Computing

Politecnico di Torino, ItalyJune, 2012

Omar U. Khan

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01



1 Overview2 OpenCL Specification

Platform ModelExecution ModelMemory ModelProgramming Model

Memory ObjectsMemory Transfers

3 First OpenCL Program4 The Language

Vector TypesFloating Point




Background

Figure : Bridging the CPU/GPU Gap


Background

Figure : Top 7 Supercomputers

[Img] http://top500.org (Retrieved 20th June, 2012)



Introduction to OpenCL

New standard from Khronos (v1.0 Release December 2008)Open SpecificationCross Platform/Vendor (CPU, GPU, DSP)GPU Giants (NVIDIA + AMD)OpenCL is a {Language, Vendor API, Native Library,Runtime SIMT based Environment}




Khronos Group

Figure : Khronos Group Open Specifications

[Img] http://www.khronos.org/about




Why OpenCL

Why OpenCL is different?There are already dozens of parallel languages. Why learna new one?OpenCL and CUDASummary of Benefits

PortabilityCompatibility (OpenGL + OpenCL)Already have GPU background? Easy to Learn.Probably already have OpenCL enabled device(s).




Available OpenCL Distributions

NVIDIAhttp://www.developer.nvidia.com/object/opencl.html

Tesla, Fermi, GeForce, etc. (Most covered)

IBM http://www.alphaworks.ibm.com/tech/openclCell-BE, Power6-7

Intel http://whatif.intel.comSamsung http://opencl.snu.ac.kr

ARM, DSP, Cell-BE


http://www.developer.nvidia.com/object/opencl.html

http://www.alphaworks.ibm.com/tech/opencl

http://whatif.intel.com

http://opencl.snu.ac.kr



Resources to Get You Started (NVIDIA)

Optimization BestPractices Guide

OpenCLProgramming Guide

OpenCL JumpstartGuide




Resources to Get You Started

gpgpu.orgKhronos OpenCLSpecification

Khronos OpenCLQuick ReferenceCard


http://www.gpgpu.org

http://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf





Resources to Get You Started

~/SDK/OpenCL

Examples




Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers










OpenCL Specification

Platform ModelHost - OpenCL Enabled Devices

Programming ModelData Parallel ModelTask Parallel Model

Execution ModelSIMD ModelSIMT Model

Memory ModelMemory Address SpacesAddress Hierarchy





Platform Model

Figure : Platform Model


oclDeviceQuery (GTX 260)




Platform Model

Figure : NVIDIA GTX-260 Compute Units





OpenCL Execution Model

Helpful NomenclatureHost CPU Side of the code (context-management)

Kernel Similar piece of code executed by eachwork-item

Thread Smallest compute unit Executing in a WarpWarp Hardware enforced thread grouping (of 32

threads)Half-Warp Group of 16 threads

NDRange (Index space) Total number of work-items alongcertain direction

Work-Group Programmer enforced thread grouping (upto 512threads/1024 on Fermi)

Work-item Thread


oclDeviceQuery (Execution Model)




Index Space Preparation

Prepare Index Space (called NDRange)Instance of Kernel executes for each co-ordinate (global ID)Arrangment of Work-Items into Work-groups.

Figure : Index Space for Kernel





Index Space Identifiers

Figure : Identification Scheme for Work-Items






Figure : Identification Scheme for Work-Items (Variant)






Figure : Identification Scheme for Multi-dimensional Data





Summary

Why the need for index space?Answer: SIMD (Single Instruction, Multiple Data)Efficient way for enforcing scalability.

NDRange - Global Sizes (Nx ,Ny )

Work Group Sizes (WGx ,WGy )

IdentifiersLocal ID (Lx , Ly )

Work Group ID (Wx ,Wy ) =(

Gx−LxWGx

,Gy−LyWGy

)Global ID (Gx ,Gy ) = (WxWGx + Lx ,WyWGy + Ly )





Built-In Functions for Fetching Index Map

Figure : Index-Map Built-in Functions


Hardware Mapping of Work-Items

Figure : Hardware Mapping of Index-Map

Figure : Memory Mapping for Work-Items




SIMD & SIMT

single instruction, multiple data streamssingle instruction, multiple threadsSIMT focuses on the execution and branching behaviour ofthe THREADSIMD focuses on the selection and execution along adata-path.OpenCL programmers usually ignore SIMT behaviour ofthreads.For peak-performance, consideration should be given.





GPU Memory Model

Figure : Memory ModelImg: Björn König (http://de.wikipedia.org/wiki/OpenCL)





GPU Memory Model

Data Movement(host ↔ global ↔ local |private)(host ↔ constant → global |local |private)Why migrate data?

Address Space Specifiers(__global, __constant, __local, __private)

__local cannot be directly initialized. Usage:

__local float x;x = 4.0;

No address specifier means __private by default





Memory Overview

Chip Cache Access Scope Life

Private On - RW 1 WI WI

Local On - RW 1 WG WG

Global Off Yes/No* RW All + CPU Host/clRelease()

Constant Off Yes RO All + CPU Host/clRelease()

Table : Memory overview

* on GPU Fermi architectures (and OpenCL for CPU), there iscache support





Memory Overview (GTX-260)

Figure : Architecture Overview of Compute Units and Memory onNVIDIA GTX-260


Memory Overview (GTX-260)




Programming Model

Data ParallelOne sequence of instructions applied to Multiple MemoryObjects1:1 mapping between work-item and memory location(s)SIMD

Task ParallelSingle instance of kernel runs without any index space.Equivalent to work done by a single work-item in awork-group.For parallelism

Use Vector Data TypesEnqueue Multiple Tasks

Hybrids also supported





Array Multiplication (Data Parallel)

for (i=0; i<N; i++) C[i] = A[i] * B[i];

Figure : Point-Wise Array Multiplication





Frequency Space Convolution (Task Parallel)

Figure : Frequency Space Convolution Operation





Example

Input:

Output:C= A (:, 1) + B (:, 1) A (:, 2)− B (:, 2) A (:, 3) ∗ B (:, 3) A (:, 4) /B (:, 4)





Example (Data Parallel)

for (j = 0 . . .max (j)) {base = j * 4C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]

}





Example (Task Parallel)

base = 4for (j = 0 . . .max (j)) C [j ∗ base + 0]=A [j ∗ base + 0]+B [j ∗ base + 0]for (j = 0 . . .max (j)) C [j ∗ base + 1]=A [j ∗ base + 1]-B [j ∗ base + 1]for (j = 0 . . .max (j)) C [j ∗ base + 2]=A [j ∗ base + 2]*B [j ∗ base + 2]for (j = 0 . . .max (j)) C [j ∗ base + 3]=A [j ∗ base + 3]/B [j ∗ base + 3]





Example (Data Parallel)

Data Parallel (data-taskv1.c)Y = get_global_id(1)base = Y*4;C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]

NDRange=1,4WG Size=1,1clEnqueueNDRangeKernel(..NDRange,WG Size..)






Task Parallel__global float4 *inputA = (__global float4*)A;\__global float4 *inputB = (__global float4*)B;\__global float4 *output = (__global float4*)C;\output[0].x = inputA[0].x + inputB[0].x;\output[1].x = inputA[1].x + inputB[1].x;\output[2].x = inputA[2].x + inputB[2].x;\output[3].x = inputA[3].x + inputB[3].x;\output[0].y = inputA[0].y - inputB[0].y;\output[1].y = inputA[1].y - inputB[1].y;\output[2].y = inputA[2].y - inputB[2].y;\output[3].y = inputA[3].y - inputB[3].y;\output[0].z = inputA[0].z * inputB[0].z;\output[1].z = inputA[1].z * inputB[1].z;\output[2].z = inputA[2].z * inputB[2].z;\output[3].z = inputA[3].z * inputB[3].z;\output[0].w = inputA[0].w / inputB[0].w;\output[1].w = inputA[1].w / inputB[1].w;\output[2].w = inputA[2].w / inputB[2].w;\output[3].w = inputA[3].w / inputB[3].w;\

clEnqueueTask(kernel)



Kernel A

__kernel void mod1tpA(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].x = inputA[0].x + inputB[0].x;output[1].x = inputA[1].x + inputB[1].x;output[2].x = inputA[2].x + inputB[2].x;output[3].x = inputA[3].x + inputB[3].x;

}

Kernel B

__kernel void mod1tpB(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].y = inputA[0].y - inputB[0].y;output[1].y = inputA[1].y - inputB[1].y;output[2].y = inputA[2].y - inputB[2].y;output[3].y = inputA[3].y - inputB[3].y;

}


Kernel C

__kernel void mod1tpC(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].z = inputA[0].z * inputB[0].z;output[1].z = inputA[1].z * inputB[1].z;output[2].z = inputA[2].z * inputB[2].z;output[3].z = inputA[3].z * inputB[3].z;

}

Kernel D

__kernel void mod1tpD(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].w = inputA[0].w / inputB[0].w;output[1].w = inputA[1].w / inputB[1].w;output[2].w = inputA[2].w / inputB[2].w;output[3].w = inputA[3].w / inputB[3].w;

}

clEnqueueTask(kernelA); clEnqueueTask(kernelB);clEnqueueTask(kernelC; clEnqueueTask(kernelD);




Programming Model

Which Model to Adopt?Considerations:

OpenCL supports bothLimited Vector Lengths (2,4,8,16)Task Parallel not just limited to VectorsAmount of code to write?Single code may have BOTH models





Memory Object Types

BufferContiguous Allocation of Memory(Same principle as Malloc)

ScalarVector

Sub-BufferAssociate Region within BufferDistribute regions across ndevicesAutomatic synchronizationbetween parent-child buffer

ImageMulti-Dimensional Structure(width-height-depth)Direct Link to texture hardwareSupports sampling, filtering





Bandwidth

Varies for each level of memory hierarchyHost to Device transfers via clEnqueueWriteBuffer() andclEnqueueReadBuffer()./oclBandwidthTest

Device to Host vs Device to DeviceChoice is Obvious

SolutionsBatch TransfersLatency HidingUse Pinned Memory (Covered in Optimizations)


Offset Transfers

clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);

clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);

Figure : Offset Transfers

clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);

clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);




Region Transfers

clEnqueueReadBufferRect(command_queue, <gpu buffer>, <blocking>,<gpu *>, <host *>, gpu_row_pitch,gpu_slice_pitch, host_row_pitch, host_slice_pitch,<host buffer>, num_event_wait_list,event_wait_list, event);

Figure : Rectangular Region TransfersOmar U. Khan Tutorial Based Lecture Notes on OpenCL 01



Example: Squaring Array

C Code

void squareCPU(float *a, int size) {int x;for (x = 0; x < size; x++)

a[x] *= a[x];}

OpenCL Code (array-square.c)

__kernel void square(__global float *a) {int x = get_global_id(0);a[x] *= a[x];

}




Squaring Array: Kernel Code

OpenCL Code__kernel void square(__global float *a) {

int x = get_global_id(0);a[x] *= a[x];

}

ANSI C99 Based Language.__kernel defines entry point of kernelget_global_id(0) built-in function for calculating theglobal index identifier__global instructs the work-item where the memory hasbeen assigned for it


Squaring Array: Host Code


OpenCL Objectscl_platformcl_device_idcl_contextcl_command_queuecl_programcl_kernelcl_mem

Host Code (Platform & Device Calls)

cat /etc/OpenCL/vendors/nvidia.icdlibcuda.so

Devices (Collection of OpenCL Devices that can be used)

Host Code (Context Call)

Prepare Context for Execution (Context is prepared usingfunctions from the OpenCL API)



Squaring Array: Host Context

Figure : Host - Device Access Process (Step 0)



Host Code: Command Queue



Command Queue

Coordinate ExecutionPlace commands on the command_queue

Kernel execution (clEnqueueTask,clEnqueueNDRangeKernel)Memory transfers (clEnqueueReadBuffer,clEnqueueWriteBuffer)Synchronization, Profiling, etc

Scheduling MethodsIn-Order: FIFO queue. All commands execute in the orderthey appear in the queue.Out-of-Order: Commands do not wait for the previouscommand to execute.All co-ordination of commands done using event objects


Host Code: Program Source

Program Objects (The executable units (from source) inthe kernel that can be sent to the devices)



Squaring Array: Host Context Objects

Figure : Host - Device Access Process (Step 4+5)



Kernels (Collection of OpenCL Functions that can beexecuted)



Source Code Errors?

clGetProgramBuildInfo()char buffer[2048];clGetProgramBuildInfo(program,

device,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,NULL);

printf("Build Log: %s\n", buffer);




Squaring Array: Host Context Objects




Memory Objects (Visible to both Host and GPU device)



Squaring Array: Completed Context


Figure : Context Sharing





OpenCL Basis

C99 with some extensions and restrictionsWhy not C11?

Multi-threading support (brings mutexes, and CPU-specificthread storage mechanisms)Structure alignments enforced

struct { struct {char c; char c;int i; char padding[3];

} int i;}





Restricted adaptation of C99

No variable length arrays (including structures with unsizedarrays)

int myArray[]; 4

No c99 standard headersBuilt-in functions also do not require headers

#include <stdio.h>, etc. etc. 4

No usage of extern, static specifiers

extern int var1; 4static int var2; 4





Restricted Adaptation of C99

No recursionReturn type from __kernel will always be void

__kernel void myKernel( .. params ..) 2�

All kernel parameters must be either __global,__constant, or __localdouble precision is optional. It may/may not be supported

#pragma OPENCL EXTENSION cl_khr_fp64: enable 2�





Scalar / Vector Data Types

Scalars (char, uchar, short, ushort, int, uint, long, ulong,float, double)Vector Types (group of scalars)

char2, float4, short8sizes (2, 4, 8, 16)

Declaratation/Initialization

float4 vectorA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vectorB = sin(vectorA);

CautionLarge Vectors increase register pressureSequential Execution on incompatible (non-SIMD) devices





Vector Component Access

Size < 4 (.xyzw)float4 vecA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vecB = vecA.xyzw; // 1,2,3,4float4 vecC = vecA.xyxy; // 1,2,1,2

Size > 4 (.s<n>)float8 vecD = (float8)(vecA,5.0f,6.0f,7.0f,8.0f);float8 vecE = (float8)(vecA.xyzw,vecD.s4567);float16 vecF = (float16)(vecE,vecE.s76543210);float16 vecG = (float16)(vecF.sfedcba9876543210);

Odd/Even (.odd, .even)float8 vecH = (float8)(vecG.odd);

High/Low (.hi, .lo)float8 vecI = (float8)(vecG.hi);





Mixing Scalars and Vectors

Scalar to Vector Casting (Note same address space)__kernel void myKernel(__global float *dataFromCPU) {

__global float4 *ptr = (__global float4*)dataFromCPU;int i;for (i = 0; i < {Size from CPU}/4; i++)

...ptr[i].s0123...}

Loading Scalar data to a Vectorvector vloadn(size_t offset,

__(global|constant|local|private) scalar *mem)

array = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }int4 vec1 = vload4(0, array); // 0, 1, 2, 3int4 vec2 = vload4(1, array); // 4, 5, 6, 7int4 vec3 = vload4(1, array+2); // 6, 7, 8, 9





Mixing Scalars and Vectors

Loading Vector Data onto a Scalar Array

void vstoren(vector vec, size_t offset,__(global|local|private)scalar *mem)

vstore8(float_vector, 0, float_array);// Store at startvstore4(int_vector, 6, int_array);// Store at offset int_array+6


Example (Reversing an Array)

array-reverse.c__kernel void square(__global float *a, __global float *b) {

int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;__global float4 *out = (__global float4*)b;out[s-x-1] = in[x].s3210;

}

Pseudo In-Place Variant__kernel void square(__global float *a) {

int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;float4 *loc = in[x].s3210;in[s-x-1] = loc;

}




Example (Reversing an Array)

Memoriessize_t t = sizeof(float)*size;cl_mem dataIn, dataOut;dataIn = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);dataOut = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);

Kernel ArgumentsclSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&dataIn);clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dataOut);

NDRange Sizes

array-square.cglobalSize[0] = size;localSize[0] = 16;

array-reverse.cglobalSize[0] = size/4;localSize[0] = globalSize[0]/2;





Dimension Rules

NDRange (Nx ,Ny ,Nz) can be any value (preferably 2n)Work Group Sizes (WGx ,WGy ,WGz) ≤ (Nx ,Ny ,Nz)

Exactly DivisbleNx%WGx == 0Ny%WGy == 0Nz%WGz == 0Range Barrier 11 ≤WGx ≤ 5121 ≤WGy ≤ 5121 ≤WGz ≤ 64

Max Limit is hardware dependentRange Barrier 2(WGx ×WGy ×WGz) ≤ 512





Dimension Rules

ExamplesglobalSize [0] = 1024, localSize [0] = 2 2�globalSize [0] = 2048, localSize [0] = 1024 4globalSize [0] = 1024, localSize [0] = 2globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 1 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 4





Vector Relational Operators

Relational comparison b/w vectors return vectorizedTRUE/FALSE (-1/0)

array-compare.c__kernel void compare(__global int *a) {

int x = get_global_id(0);__global int4 *in = (__global int4*)a;in[x] = (float4)(1.0f,1.0f,1.0f,1.0f) ==

(float4)(0.0f,1.0f,2.0f,3.0f);}

Output0 -1 0 0 0 -1 0 0





Scalar Relational Operators

Relational comparison b/w scalars return TRUE/FALSE(1/0)


int x = get_global_id(0);a[4*x+0] = 1.0f == 0.0f;a[4*x+1] = 1.0f == 1.0f;a[4*x+2] = 1.0f == 2.0f;a[4*x+3] = 1.0f == 3.0f;

}

Output0 1 0 0 0 1 0 0





Vectors in If and While Structures


int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);if (b < c) in[x] = 2;else in[x] = 3;

}

conditions will not be visited

Solution: any/allif (any(b < c)) in[x] = 2;else in[x] = 3;





Vectors in Ternay Operators


int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);in[x] = b < c ? 1 : 2;

}

Output2 2 1 1

Scalar Syntax: <relation> ? TRUE : FALSE

Vector Syntax <relation> ? FALSE : TRUE





Floating Point Computation Background

Data Types and RangeIEEE-754 and OpenCL

Supported: floatOptional: double & half

Optional?

IEEE 754 Number Modes1 Normal numbers2 Denormal numbers3 Infinite numbers4 NaNs

IEEE 754 Rounding ModesRound to nearest EvenRound towards +∞Round towards −∞Round towards 0





Float

Performance optimized for floats32-bit (1 sign, 8 exponent, 23 fraction)Normal Range ≈ 1.18× 10−38 . . . 3.4× 1038

Number Modes SupportedNormalInfiniteNaN

Why not Denormal? Takes more cycles.Rounding Modes Supported

Round to Nearest Even





Double

64-bit (1 sign, 11 exponent, 52 fraction)Normal Range ≈ 1 × 10−323.3 . . . 1 × 10308.3

Optional. To check

cl_ulong putHere;clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE,

sizeof(putHere), &putHere, NULL);

To enable

#pragma OPENCL EXTENSION cl_khr_fp64: enable





Double

Double type supports: Normal, Denormal, Infinite, NaNWhy Denormal in Double and not in float?

(x − y) /zFor performance, disable support using -cl-denorms-are-zeroin clBuildProgram

Rounding Modes Supported: All

FMA Support

double a, b, c;a*b+c





Half

16-bit (1 sign, 5 exponent, 10 fraction)Normal Range ≈ 6.1 × 10−5 . . . 65504

Optional. To check:CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF

To enable:#pragma OPENCL EXTENSION cl_khr_fp16: enable

Number Modes supported (Normal, Infinite, NaN)Rounding Modes supported: Either

±∞, ORNearest Even

FMA Support: No





Default Modes

Operation Type Rounding ModeFloating Point Arithmetic RTE

Builtin functions (e.g., math) RTEcast float→int RTZcast int→float RTE

Table : Default OpenCL Rounding Modes





Summary

Parameter Float Double HalfNormal 2� 2� 2�

Denormalized 4 2� 4Infinity/NaN 2� 2� 2�Round Even 2� 2� ARound ±∞ 4 2�Round Zero 4 2� 4

FMA 4 2� 4Table : IEEE-754 and OpenCL Compliancy





References


tutorialbasedlecturenotesonopencl01overview openclspeciﬁcation firstopenclprogram thelanguage 1...

Documents