tutorialbasedlecturenotesonopencl01overview openclspecification firstopenclprogram thelanguage 1...
TRANSCRIPT
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Tutorial Based Lecture Notes on OpenCL 01
For PhD level courseParallel & Distributed Computing
Politecnico di Torino, ItalyJune, 2012
Omar U. Khan
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
1 Overview2 OpenCL Specification
Platform ModelExecution ModelMemory ModelProgramming Model
Memory ObjectsMemory Transfers
3 First OpenCL Program4 The Language
Vector TypesFloating Point
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Background
Figure : Bridging the CPU/GPU Gap
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Background
Figure : Top 7 Supercomputers
[Img] http://top500.org (Retrieved 20th June, 2012)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Introduction to OpenCL
New standard from Khronos (v1.0 Release December 2008)Open SpecificationCross Platform/Vendor (CPU, GPU, DSP)GPU Giants (NVIDIA + AMD)OpenCL is a {Language, Vendor API, Native Library,Runtime SIMT based Environment}
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Khronos Group
Figure : Khronos Group Open Specifications
[Img] http://www.khronos.org/about
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Why OpenCL
Why OpenCL is different?There are already dozens of parallel languages. Why learna new one?OpenCL and CUDASummary of Benefits
PortabilityCompatibility (OpenGL + OpenCL)Already have GPU background? Easy to Learn.Probably already have OpenCL enabled device(s).
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Available OpenCL Distributions
NVIDIAhttp://www.developer.nvidia.com/object/opencl.html
Tesla, Fermi, GeForce, etc. (Most covered)
IBM http://www.alphaworks.ibm.com/tech/openclCell-BE, Power6-7
Intel http://whatif.intel.comSamsung http://opencl.snu.ac.kr
ARM, DSP, Cell-BE
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Resources to Get You Started (NVIDIA)
Optimization BestPractices Guide
OpenCLProgramming Guide
OpenCL JumpstartGuide
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Resources to Get You Started
gpgpu.orgKhronos OpenCLSpecification
Khronos OpenCLQuick ReferenceCard
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Resources to Get You Started
~/SDK/OpenCL
Examples
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
1 Overview2 OpenCL Specification
Platform ModelExecution ModelMemory ModelProgramming Model
Memory ObjectsMemory Transfers
3 First OpenCL Program4 The Language
Vector TypesFloating Point
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
OpenCL Specification
Platform ModelHost - OpenCL Enabled Devices
Programming ModelData Parallel ModelTask Parallel Model
Execution ModelSIMD ModelSIMT Model
Memory ModelMemory Address SpacesAddress Hierarchy
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Platform Model
Figure : Platform Model
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
oclDeviceQuery (GTX 260)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Platform Model
Figure : NVIDIA GTX-260 Compute Units
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
OpenCL Execution Model
Helpful NomenclatureHost CPU Side of the code (context-management)
Kernel Similar piece of code executed by eachwork-item
Thread Smallest compute unit Executing in a WarpWarp Hardware enforced thread grouping (of 32
threads)Half-Warp Group of 16 threads
NDRange (Index space) Total number of work-items alongcertain direction
Work-Group Programmer enforced thread grouping (upto 512threads/1024 on Fermi)
Work-item Thread
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
oclDeviceQuery (Execution Model)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Index Space Preparation
Prepare Index Space (called NDRange)Instance of Kernel executes for each co-ordinate (global ID)Arrangment of Work-Items into Work-groups.
Figure : Index Space for Kernel
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Index Space Identifiers
Figure : Identification Scheme for Work-Items
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Index Space Identifiers
Figure : Identification Scheme for Work-Items (Variant)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Index Space Identifiers
Figure : Identification Scheme for Multi-dimensional Data
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Summary
Why the need for index space?Answer: SIMD (Single Instruction, Multiple Data)Efficient way for enforcing scalability.
NDRange - Global Sizes (Nx ,Ny )
Work Group Sizes (WGx ,WGy )
IdentifiersLocal ID (Lx , Ly )
Work Group ID (Wx ,Wy ) =(
Gx−LxWGx
,Gy−LyWGy
)Global ID (Gx ,Gy ) = (WxWGx + Lx ,WyWGy + Ly )
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Built-In Functions for Fetching Index Map
Figure : Index-Map Built-in Functions
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Hardware Mapping of Work-Items
Figure : Hardware Mapping of Index-Map
Figure : Memory Mapping for Work-Items
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
SIMD & SIMT
single instruction, multiple data streamssingle instruction, multiple threadsSIMT focuses on the execution and branching behaviour ofthe THREADSIMD focuses on the selection and execution along adata-path.OpenCL programmers usually ignore SIMT behaviour ofthreads.For peak-performance, consideration should be given.
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
GPU Memory Model
Figure : Memory ModelImg: Björn König (http://de.wikipedia.org/wiki/OpenCL)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
GPU Memory Model
Data Movement(host ↔ global ↔ local |private)(host ↔ constant → global |local |private)Why migrate data?
Address Space Specifiers(__global, __constant, __local, __private)
__local cannot be directly initialized. Usage:
__local float x;x = 4.0;
No address specifier means __private by default
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Memory Overview
Chip Cache Access Scope Life
Private On - RW 1 WI WI
Local On - RW 1 WG WG
Global Off Yes/No* RW All + CPU Host/clRelease()
Constant Off Yes RO All + CPU Host/clRelease()
Table : Memory overview
* on GPU Fermi architectures (and OpenCL for CPU), there iscache support
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Memory Overview (GTX-260)
Figure : Architecture Overview of Compute Units and Memory onNVIDIA GTX-260
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Memory Overview (GTX-260)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Programming Model
Data ParallelOne sequence of instructions applied to Multiple MemoryObjects1:1 mapping between work-item and memory location(s)SIMD
Task ParallelSingle instance of kernel runs without any index space.Equivalent to work done by a single work-item in awork-group.For parallelism
Use Vector Data TypesEnqueue Multiple Tasks
Hybrids also supported
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Array Multiplication (Data Parallel)
for (i=0; i<N; i++) C[i] = A[i] * B[i];
Figure : Point-Wise Array Multiplication
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Frequency Space Convolution (Task Parallel)
Figure : Frequency Space Convolution Operation
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Example
Input:
Output:C= A (:, 1) + B (:, 1) A (:, 2)− B (:, 2) A (:, 3) ∗ B (:, 3) A (:, 4) /B (:, 4)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Example (Data Parallel)
for (j = 0 . . .max (j)) {base = j * 4C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]
}
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Example (Task Parallel)
base = 4for (j = 0 . . .max (j)) C [j ∗ base + 0]=A [j ∗ base + 0]+B [j ∗ base + 0]for (j = 0 . . .max (j)) C [j ∗ base + 1]=A [j ∗ base + 1]-B [j ∗ base + 1]for (j = 0 . . .max (j)) C [j ∗ base + 2]=A [j ∗ base + 2]*B [j ∗ base + 2]for (j = 0 . . .max (j)) C [j ∗ base + 3]=A [j ∗ base + 3]/B [j ∗ base + 3]
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Example (Data Parallel)
Data Parallel (data-taskv1.c)Y = get_global_id(1)base = Y*4;C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]
NDRange=1,4WG Size=1,1clEnqueueNDRangeKernel(..NDRange,WG Size..)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Example (Task Parallel)
Task Parallel__global float4 *inputA = (__global float4*)A;\__global float4 *inputB = (__global float4*)B;\__global float4 *output = (__global float4*)C;\output[0].x = inputA[0].x + inputB[0].x;\output[1].x = inputA[1].x + inputB[1].x;\output[2].x = inputA[2].x + inputB[2].x;\output[3].x = inputA[3].x + inputB[3].x;\output[0].y = inputA[0].y - inputB[0].y;\output[1].y = inputA[1].y - inputB[1].y;\output[2].y = inputA[2].y - inputB[2].y;\output[3].y = inputA[3].y - inputB[3].y;\output[0].z = inputA[0].z * inputB[0].z;\output[1].z = inputA[1].z * inputB[1].z;\output[2].z = inputA[2].z * inputB[2].z;\output[3].z = inputA[3].z * inputB[3].z;\output[0].w = inputA[0].w / inputB[0].w;\output[1].w = inputA[1].w / inputB[1].w;\output[2].w = inputA[2].w / inputB[2].w;\output[3].w = inputA[3].w / inputB[3].w;\
clEnqueueTask(kernel)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Example (Task Parallel)
Kernel A
__kernel void mod1tpA(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].x = inputA[0].x + inputB[0].x;output[1].x = inputA[1].x + inputB[1].x;output[2].x = inputA[2].x + inputB[2].x;output[3].x = inputA[3].x + inputB[3].x;
}
Kernel B
__kernel void mod1tpB(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].y = inputA[0].y - inputB[0].y;output[1].y = inputA[1].y - inputB[1].y;output[2].y = inputA[2].y - inputB[2].y;output[3].y = inputA[3].y - inputB[3].y;
}
Example (Task Parallel)
Kernel C
__kernel void mod1tpC(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].z = inputA[0].z * inputB[0].z;output[1].z = inputA[1].z * inputB[1].z;output[2].z = inputA[2].z * inputB[2].z;output[3].z = inputA[3].z * inputB[3].z;
}
Kernel D
__kernel void mod1tpD(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].w = inputA[0].w / inputB[0].w;output[1].w = inputA[1].w / inputB[1].w;output[2].w = inputA[2].w / inputB[2].w;output[3].w = inputA[3].w / inputB[3].w;
}
clEnqueueTask(kernelA); clEnqueueTask(kernelB);clEnqueueTask(kernelC; clEnqueueTask(kernelD);
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Programming Model
Which Model to Adopt?Considerations:
OpenCL supports bothLimited Vector Lengths (2,4,8,16)Task Parallel not just limited to VectorsAmount of code to write?Single code may have BOTH models
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Memory Object Types
BufferContiguous Allocation of Memory(Same principle as Malloc)
ScalarVector
Sub-BufferAssociate Region within BufferDistribute regions across ndevicesAutomatic synchronizationbetween parent-child buffer
ImageMulti-Dimensional Structure(width-height-depth)Direct Link to texture hardwareSupports sampling, filtering
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Bandwidth
Varies for each level of memory hierarchyHost to Device transfers via clEnqueueWriteBuffer() andclEnqueueReadBuffer()./oclBandwidthTest
Device to Host vs Device to DeviceChoice is Obvious
SolutionsBatch TransfersLatency HidingUse Pinned Memory (Covered in Optimizations)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Offset Transfers
clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);
clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);
Figure : Offset Transfers
clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);
clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers
Region Transfers
clEnqueueReadBufferRect(command_queue, <gpu buffer>, <blocking>,<gpu *>, <host *>, gpu_row_pitch,gpu_slice_pitch, host_row_pitch, host_slice_pitch,<host buffer>, num_event_wait_list,event_wait_list, event);
Figure : Rectangular Region TransfersOmar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
1 Overview2 OpenCL Specification
Platform ModelExecution ModelMemory ModelProgramming Model
Memory ObjectsMemory Transfers
3 First OpenCL Program4 The Language
Vector TypesFloating Point
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Example: Squaring Array
C Code
void squareCPU(float *a, int size) {int x;for (x = 0; x < size; x++)
a[x] *= a[x];}
OpenCL Code (array-square.c)
__kernel void square(__global float *a) {int x = get_global_id(0);a[x] *= a[x];
}
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Squaring Array: Kernel Code
OpenCL Code__kernel void square(__global float *a) {
int x = get_global_id(0);a[x] *= a[x];
}
ANSI C99 Based Language.__kernel defines entry point of kernelget_global_id(0) built-in function for calculating theglobal index identifier__global instructs the work-item where the memory hasbeen assigned for it
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Squaring Array: Host Code
Squaring Array: Host Code
Squaring Array: Host Code
OpenCL Objectscl_platformcl_device_idcl_contextcl_command_queuecl_programcl_kernelcl_mem
Host Code (Platform & Device Calls)
cat /etc/OpenCL/vendors/nvidia.icdlibcuda.so
Devices (Collection of OpenCL Devices that can be used)
Host Code (Context Call)
Prepare Context for Execution (Context is prepared usingfunctions from the OpenCL API)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Squaring Array: Host Context
Figure : Host - Device Access Process (Step 0)
Figure : Host - Device Access Process (Step 3)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Host Code: Command Queue
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Command Queue
Coordinate ExecutionPlace commands on the command_queue
Kernel execution (clEnqueueTask,clEnqueueNDRangeKernel)Memory transfers (clEnqueueReadBuffer,clEnqueueWriteBuffer)Synchronization, Profiling, etc
Scheduling MethodsIn-Order: FIFO queue. All commands execute in the orderthey appear in the queue.Out-of-Order: Commands do not wait for the previouscommand to execute.All co-ordination of commands done using event objects
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Host Code: Program Source
Program Objects (The executable units (from source) inthe kernel that can be sent to the devices)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Squaring Array: Host Context Objects
Figure : Host - Device Access Process (Step 4+5)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Squaring Array: Host Code
Kernels (Collection of OpenCL Functions that can beexecuted)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Source Code Errors?
clGetProgramBuildInfo()char buffer[2048];clGetProgramBuildInfo(program,
device,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,NULL);
printf("Build Log: %s\n", buffer);
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Squaring Array: Host Context Objects
Figure : Host - Device Access Process (Step 7)
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Squaring Array: Host Code
Memory Objects (Visible to both Host and GPU device)
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Squaring Array: Completed Context
Figure : Host - Device Access Process (Step 7)
Figure : Context Sharing
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Squaring Array: Host Code
Squaring Array: Host Code
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
1 Overview2 OpenCL Specification
Platform ModelExecution ModelMemory ModelProgramming Model
Memory ObjectsMemory Transfers
3 First OpenCL Program4 The Language
Vector TypesFloating Point
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
OpenCL Basis
C99 with some extensions and restrictionsWhy not C11?
Multi-threading support (brings mutexes, and CPU-specificthread storage mechanisms)Structure alignments enforced
struct { struct {char c; char c;int i; char padding[3];
} int i;}
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Restricted adaptation of C99
No variable length arrays (including structures with unsizedarrays)
int myArray[]; 4
No c99 standard headersBuilt-in functions also do not require headers
#include <stdio.h>, etc. etc. 4
No usage of extern, static specifiers
extern int var1; 4static int var2; 4
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Restricted Adaptation of C99
No recursionReturn type from __kernel will always be void
__kernel void myKernel( .. params ..) 2�
All kernel parameters must be either __global,__constant, or __localdouble precision is optional. It may/may not be supported
#pragma OPENCL EXTENSION cl_khr_fp64: enable 2�
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Scalar / Vector Data Types
Scalars (char, uchar, short, ushort, int, uint, long, ulong,float, double)Vector Types (group of scalars)
char2, float4, short8sizes (2, 4, 8, 16)
Declaratation/Initialization
float4 vectorA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vectorB = sin(vectorA);
CautionLarge Vectors increase register pressureSequential Execution on incompatible (non-SIMD) devices
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Vector Component Access
Size < 4 (.xyzw)float4 vecA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vecB = vecA.xyzw; // 1,2,3,4float4 vecC = vecA.xyxy; // 1,2,1,2
Size > 4 (.s<n>)float8 vecD = (float8)(vecA,5.0f,6.0f,7.0f,8.0f);float8 vecE = (float8)(vecA.xyzw,vecD.s4567);float16 vecF = (float16)(vecE,vecE.s76543210);float16 vecG = (float16)(vecF.sfedcba9876543210);
Odd/Even (.odd, .even)float8 vecH = (float8)(vecG.odd);
High/Low (.hi, .lo)float8 vecI = (float8)(vecG.hi);
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Mixing Scalars and Vectors
Scalar to Vector Casting (Note same address space)__kernel void myKernel(__global float *dataFromCPU) {
__global float4 *ptr = (__global float4*)dataFromCPU;int i;for (i = 0; i < {Size from CPU}/4; i++)
...ptr[i].s0123...}
Loading Scalar data to a Vectorvector vloadn(size_t offset,
__(global|constant|local|private) scalar *mem)
array = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }int4 vec1 = vload4(0, array); // 0, 1, 2, 3int4 vec2 = vload4(1, array); // 4, 5, 6, 7int4 vec3 = vload4(1, array+2); // 6, 7, 8, 9
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Mixing Scalars and Vectors
Loading Vector Data onto a Scalar Array
void vstoren(vector vec, size_t offset,__(global|local|private)scalar *mem)
vstore8(float_vector, 0, float_array);// Store at startvstore4(int_vector, 6, int_array);// Store at offset int_array+6
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
Example (Reversing an Array)
array-reverse.c__kernel void square(__global float *a, __global float *b) {
int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;__global float4 *out = (__global float4*)b;out[s-x-1] = in[x].s3210;
}
Pseudo In-Place Variant__kernel void square(__global float *a) {
int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;float4 *loc = in[x].s3210;in[s-x-1] = loc;
}
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Example (Reversing an Array)
Memoriessize_t t = sizeof(float)*size;cl_mem dataIn, dataOut;dataIn = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);dataOut = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);
Kernel ArgumentsclSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&dataIn);clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dataOut);
NDRange Sizes
array-square.cglobalSize[0] = size;localSize[0] = 16;
array-reverse.cglobalSize[0] = size/4;localSize[0] = globalSize[0]/2;
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Dimension Rules
NDRange (Nx ,Ny ,Nz) can be any value (preferably 2n)Work Group Sizes (WGx ,WGy ,WGz) ≤ (Nx ,Ny ,Nz)
Exactly DivisbleNx%WGx == 0Ny%WGy == 0Nz%WGz == 0Range Barrier 11 ≤WGx ≤ 5121 ≤WGy ≤ 5121 ≤WGz ≤ 64
Max Limit is hardware dependentRange Barrier 2(WGx ×WGy ×WGz) ≤ 512
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Dimension Rules
ExamplesglobalSize [0] = 1024, localSize [0] = 2 2�globalSize [0] = 2048, localSize [0] = 1024 4globalSize [0] = 1024, localSize [0] = 2globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 1 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 4
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Vector Relational Operators
Relational comparison b/w vectors return vectorizedTRUE/FALSE (-1/0)
array-compare.c__kernel void compare(__global int *a) {
int x = get_global_id(0);__global int4 *in = (__global int4*)a;in[x] = (float4)(1.0f,1.0f,1.0f,1.0f) ==
(float4)(0.0f,1.0f,2.0f,3.0f);}
Output0 -1 0 0 0 -1 0 0
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Scalar Relational Operators
Relational comparison b/w scalars return TRUE/FALSE(1/0)
array-compare.c__kernel void compare(__global int *a) {
int x = get_global_id(0);a[4*x+0] = 1.0f == 0.0f;a[4*x+1] = 1.0f == 1.0f;a[4*x+2] = 1.0f == 2.0f;a[4*x+3] = 1.0f == 3.0f;
}
Output0 1 0 0 0 1 0 0
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Vectors in If and While Structures
array-compare.c__kernel void compare(__global int *a) {
int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);if (b < c) in[x] = 2;else in[x] = 3;
}
conditions will not be visited
Solution: any/allif (any(b < c)) in[x] = 2;else in[x] = 3;
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Vectors in Ternay Operators
array-compare.c__kernel void compare(__global int *a) {
int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);in[x] = b < c ? 1 : 2;
}
Output2 2 1 1
Scalar Syntax: <relation> ? TRUE : FALSE
Vector Syntax <relation> ? FALSE : TRUE
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Floating Point Computation Background
Data Types and RangeIEEE-754 and OpenCL
Supported: floatOptional: double & half
Optional?
IEEE 754 Number Modes1 Normal numbers2 Denormal numbers3 Infinite numbers4 NaNs
IEEE 754 Rounding ModesRound to nearest EvenRound towards +∞Round towards −∞Round towards 0
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Float
Performance optimized for floats32-bit (1 sign, 8 exponent, 23 fraction)Normal Range ≈ 1.18× 10−38 . . . 3.4× 1038
Number Modes SupportedNormalInfiniteNaN
Why not Denormal? Takes more cycles.Rounding Modes Supported
Round to Nearest Even
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Double
64-bit (1 sign, 11 exponent, 52 fraction)Normal Range ≈ 1 × 10−323.3 . . . 1 × 10308.3
Optional. To check
cl_ulong putHere;clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE,
sizeof(putHere), &putHere, NULL);
To enable
#pragma OPENCL EXTENSION cl_khr_fp64: enable
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Double
Double type supports: Normal, Denormal, Infinite, NaNWhy Denormal in Double and not in float?
(x − y) /zFor performance, disable support using -cl-denorms-are-zeroin clBuildProgram
Rounding Modes Supported: All
FMA Support
double a, b, c;a*b+c
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Half
16-bit (1 sign, 5 exponent, 10 fraction)Normal Range ≈ 6.1 × 10−5 . . . 65504
Optional. To check:CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF
To enable:#pragma OPENCL EXTENSION cl_khr_fp16: enable
Number Modes supported (Normal, Infinite, NaN)Rounding Modes supported: Either
±∞, ORNearest Even
FMA Support: No
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Default Modes
Operation Type Rounding ModeFloating Point Arithmetic RTE
Builtin functions (e.g., math) RTEcast float→int RTZcast int→float RTE
Table : Default OpenCL Rounding Modes
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
Summary
Parameter Float Double HalfNormal 2� 2� 2�
Denormalized 4 2� 4Infinity/NaN 2� 2� 2�Round Even 2� 2� ARound ±∞ 4 2�Round Zero 4 2� 4
FMA 4 2� 4Table : IEEE-754 and OpenCL Compliancy
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01
OverviewOpenCL Specification
First OpenCL ProgramThe Language
Vector TypesFloating Point
References
Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01