tutorialbasedlecturenotesonopencl01overview openclspecification firstopenclprogram thelanguage 1...

90
Overview OpenCL Specification First OpenCL Program The Language Tutorial Based Lecture Notes on OpenCL 01 For PhD level course Parallel & Distributed Computing Politecnico di Torino, Italy June, 2012 Omar U. Khan Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Upload: others

Post on 07-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Tutorial Based Lecture Notes on OpenCL 01

For PhD level courseParallel & Distributed Computing

Politecnico di Torino, ItalyJune, 2012

Omar U. Khan

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 2: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

1 Overview2 OpenCL Specification

Platform ModelExecution ModelMemory ModelProgramming Model

Memory ObjectsMemory Transfers

3 First OpenCL Program4 The Language

Vector TypesFloating Point

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 3: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Background

Figure : Bridging the CPU/GPU Gap

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 4: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Background

Figure : Top 7 Supercomputers

[Img] http://top500.org (Retrieved 20th June, 2012)

Page 5: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Introduction to OpenCL

New standard from Khronos (v1.0 Release December 2008)Open SpecificationCross Platform/Vendor (CPU, GPU, DSP)GPU Giants (NVIDIA + AMD)OpenCL is a {Language, Vendor API, Native Library,Runtime SIMT based Environment}

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 6: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Khronos Group

Figure : Khronos Group Open Specifications

[Img] http://www.khronos.org/about

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 7: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Why OpenCL

Why OpenCL is different?There are already dozens of parallel languages. Why learna new one?OpenCL and CUDASummary of Benefits

PortabilityCompatibility (OpenGL + OpenCL)Already have GPU background? Easy to Learn.Probably already have OpenCL enabled device(s).

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 8: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Available OpenCL Distributions

NVIDIAhttp://www.developer.nvidia.com/object/opencl.html

Tesla, Fermi, GeForce, etc. (Most covered)

IBM http://www.alphaworks.ibm.com/tech/openclCell-BE, Power6-7

Intel http://whatif.intel.comSamsung http://opencl.snu.ac.kr

ARM, DSP, Cell-BE

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 9: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Resources to Get You Started (NVIDIA)

Optimization BestPractices Guide

OpenCLProgramming Guide

OpenCL JumpstartGuide

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 10: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Resources to Get You Started

gpgpu.orgKhronos OpenCLSpecification

Khronos OpenCLQuick ReferenceCard

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 11: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Resources to Get You Started

~/SDK/OpenCL

Examples

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 12: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

1 Overview2 OpenCL Specification

Platform ModelExecution ModelMemory ModelProgramming Model

Memory ObjectsMemory Transfers

3 First OpenCL Program4 The Language

Vector TypesFloating Point

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 13: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

OpenCL Specification

Platform ModelHost - OpenCL Enabled Devices

Programming ModelData Parallel ModelTask Parallel Model

Execution ModelSIMD ModelSIMT Model

Memory ModelMemory Address SpacesAddress Hierarchy

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 14: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Platform Model

Figure : Platform Model

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 15: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

oclDeviceQuery (GTX 260)

Page 16: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Platform Model

Figure : NVIDIA GTX-260 Compute Units

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 17: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

OpenCL Execution Model

Helpful NomenclatureHost CPU Side of the code (context-management)

Kernel Similar piece of code executed by eachwork-item

Thread Smallest compute unit Executing in a WarpWarp Hardware enforced thread grouping (of 32

threads)Half-Warp Group of 16 threads

NDRange (Index space) Total number of work-items alongcertain direction

Work-Group Programmer enforced thread grouping (upto 512threads/1024 on Fermi)

Work-item Thread

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 18: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

oclDeviceQuery (Execution Model)

Page 19: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Index Space Preparation

Prepare Index Space (called NDRange)Instance of Kernel executes for each co-ordinate (global ID)Arrangment of Work-Items into Work-groups.

Figure : Index Space for Kernel

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 20: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Index Space Identifiers

Figure : Identification Scheme for Work-Items

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 21: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Index Space Identifiers

Figure : Identification Scheme for Work-Items (Variant)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 22: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Index Space Identifiers

Figure : Identification Scheme for Multi-dimensional Data

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 23: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Summary

Why the need for index space?Answer: SIMD (Single Instruction, Multiple Data)Efficient way for enforcing scalability.

NDRange - Global Sizes (Nx ,Ny )

Work Group Sizes (WGx ,WGy )

IdentifiersLocal ID (Lx , Ly )

Work Group ID (Wx ,Wy ) =(

Gx−LxWGx

,Gy−LyWGy

)Global ID (Gx ,Gy ) = (WxWGx + Lx ,WyWGy + Ly )

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 24: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Built-In Functions for Fetching Index Map

Figure : Index-Map Built-in Functions

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 25: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Hardware Mapping of Work-Items

Figure : Hardware Mapping of Index-Map

Figure : Memory Mapping for Work-Items

Page 26: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

SIMD & SIMT

single instruction, multiple data streamssingle instruction, multiple threadsSIMT focuses on the execution and branching behaviour ofthe THREADSIMD focuses on the selection and execution along adata-path.OpenCL programmers usually ignore SIMT behaviour ofthreads.For peak-performance, consideration should be given.

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 27: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

GPU Memory Model

Figure : Memory ModelImg: Björn König (http://de.wikipedia.org/wiki/OpenCL)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 28: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

GPU Memory Model

Data Movement(host ↔ global ↔ local |private)(host ↔ constant → global |local |private)Why migrate data?

Address Space Specifiers(__global, __constant, __local, __private)

__local cannot be directly initialized. Usage:

__local float x;x = 4.0;

No address specifier means __private by default

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 29: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Memory Overview

Chip Cache Access Scope Life

Private On - RW 1 WI WI

Local On - RW 1 WG WG

Global Off Yes/No* RW All + CPU Host/clRelease()

Constant Off Yes RO All + CPU Host/clRelease()

Table : Memory overview

* on GPU Fermi architectures (and OpenCL for CPU), there iscache support

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 30: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Memory Overview (GTX-260)

Figure : Architecture Overview of Compute Units and Memory onNVIDIA GTX-260

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 31: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Memory Overview (GTX-260)

Page 32: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Programming Model

Data ParallelOne sequence of instructions applied to Multiple MemoryObjects1:1 mapping between work-item and memory location(s)SIMD

Task ParallelSingle instance of kernel runs without any index space.Equivalent to work done by a single work-item in awork-group.For parallelism

Use Vector Data TypesEnqueue Multiple Tasks

Hybrids also supported

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 33: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Array Multiplication (Data Parallel)

for (i=0; i<N; i++) C[i] = A[i] * B[i];

Figure : Point-Wise Array Multiplication

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 34: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Frequency Space Convolution (Task Parallel)

Figure : Frequency Space Convolution Operation

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 35: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Example

Input:

Output:C= A (:, 1) + B (:, 1) A (:, 2)− B (:, 2) A (:, 3) ∗ B (:, 3) A (:, 4) /B (:, 4)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 36: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Example (Data Parallel)

for (j = 0 . . .max (j)) {base = j * 4C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]

}

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 37: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Example (Task Parallel)

base = 4for (j = 0 . . .max (j)) C [j ∗ base + 0]=A [j ∗ base + 0]+B [j ∗ base + 0]for (j = 0 . . .max (j)) C [j ∗ base + 1]=A [j ∗ base + 1]-B [j ∗ base + 1]for (j = 0 . . .max (j)) C [j ∗ base + 2]=A [j ∗ base + 2]*B [j ∗ base + 2]for (j = 0 . . .max (j)) C [j ∗ base + 3]=A [j ∗ base + 3]/B [j ∗ base + 3]

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 38: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Example (Data Parallel)

Data Parallel (data-taskv1.c)Y = get_global_id(1)base = Y*4;C [base + 0]=A [base + 0]+B [base + 0]C [base + 1]=A [base + 1]-B [base + 1]C [base + 2]=A [base + 2]*B [base + 2]C [base + 3]=A [base + 3]/B [base + 3]

NDRange=1,4WG Size=1,1clEnqueueNDRangeKernel(..NDRange,WG Size..)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 39: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Example (Task Parallel)

Task Parallel__global float4 *inputA = (__global float4*)A;\__global float4 *inputB = (__global float4*)B;\__global float4 *output = (__global float4*)C;\output[0].x = inputA[0].x + inputB[0].x;\output[1].x = inputA[1].x + inputB[1].x;\output[2].x = inputA[2].x + inputB[2].x;\output[3].x = inputA[3].x + inputB[3].x;\output[0].y = inputA[0].y - inputB[0].y;\output[1].y = inputA[1].y - inputB[1].y;\output[2].y = inputA[2].y - inputB[2].y;\output[3].y = inputA[3].y - inputB[3].y;\output[0].z = inputA[0].z * inputB[0].z;\output[1].z = inputA[1].z * inputB[1].z;\output[2].z = inputA[2].z * inputB[2].z;\output[3].z = inputA[3].z * inputB[3].z;\output[0].w = inputA[0].w / inputB[0].w;\output[1].w = inputA[1].w / inputB[1].w;\output[2].w = inputA[2].w / inputB[2].w;\output[3].w = inputA[3].w / inputB[3].w;\

clEnqueueTask(kernel)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 40: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Example (Task Parallel)

Kernel A

__kernel void mod1tpA(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].x = inputA[0].x + inputB[0].x;output[1].x = inputA[1].x + inputB[1].x;output[2].x = inputA[2].x + inputB[2].x;output[3].x = inputA[3].x + inputB[3].x;

}

Kernel B

__kernel void mod1tpB(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].y = inputA[0].y - inputB[0].y;output[1].y = inputA[1].y - inputB[1].y;output[2].y = inputA[2].y - inputB[2].y;output[3].y = inputA[3].y - inputB[3].y;

}

Page 41: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Example (Task Parallel)

Kernel C

__kernel void mod1tpC(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].z = inputA[0].z * inputB[0].z;output[1].z = inputA[1].z * inputB[1].z;output[2].z = inputA[2].z * inputB[2].z;output[3].z = inputA[3].z * inputB[3].z;

}

Kernel D

__kernel void mod1tpD(...) {__global float4 *inputA = (__global float4*)A;__global float4 *inputB = (__global float4*)B;__global float4 *output = (__global float4*)C;output[0].w = inputA[0].w / inputB[0].w;output[1].w = inputA[1].w / inputB[1].w;output[2].w = inputA[2].w / inputB[2].w;output[3].w = inputA[3].w / inputB[3].w;

}

clEnqueueTask(kernelA); clEnqueueTask(kernelB);clEnqueueTask(kernelC; clEnqueueTask(kernelD);

Page 42: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Programming Model

Which Model to Adopt?Considerations:

OpenCL supports bothLimited Vector Lengths (2,4,8,16)Task Parallel not just limited to VectorsAmount of code to write?Single code may have BOTH models

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 43: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Memory Object Types

BufferContiguous Allocation of Memory(Same principle as Malloc)

ScalarVector

Sub-BufferAssociate Region within BufferDistribute regions across ndevicesAutomatic synchronizationbetween parent-child buffer

ImageMulti-Dimensional Structure(width-height-depth)Direct Link to texture hardwareSupports sampling, filtering

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 44: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Bandwidth

Varies for each level of memory hierarchyHost to Device transfers via clEnqueueWriteBuffer() andclEnqueueReadBuffer()./oclBandwidthTest

Device to Host vs Device to DeviceChoice is Obvious

SolutionsBatch TransfersLatency HidingUse Pinned Memory (Covered in Optimizations)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 45: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Offset Transfers

clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);

clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 0, sizeof(float)*16, <*>+9, 0, NULL, NULL);

Figure : Offset Transfers

clEnqueueWriteBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);

clEnqueueReadBuffer(cmd_queue, <cl_mem>, CL_TRUE, 9, sizeof(float)*16, <*>, 0, NULL, NULL);

Page 46: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Platform ModelExecution ModelMemory ModelProgramming ModelMemory ObjectsMemory Transfers

Region Transfers

clEnqueueReadBufferRect(command_queue, <gpu buffer>, <blocking>,<gpu *>, <host *>, gpu_row_pitch,gpu_slice_pitch, host_row_pitch, host_slice_pitch,<host buffer>, num_event_wait_list,event_wait_list, event);

Figure : Rectangular Region TransfersOmar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 47: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

1 Overview2 OpenCL Specification

Platform ModelExecution ModelMemory ModelProgramming Model

Memory ObjectsMemory Transfers

3 First OpenCL Program4 The Language

Vector TypesFloating Point

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 48: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Example: Squaring Array

C Code

void squareCPU(float *a, int size) {int x;for (x = 0; x < size; x++)

a[x] *= a[x];}

OpenCL Code (array-square.c)

__kernel void square(__global float *a) {int x = get_global_id(0);a[x] *= a[x];

}

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 49: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Squaring Array: Kernel Code

OpenCL Code__kernel void square(__global float *a) {

int x = get_global_id(0);a[x] *= a[x];

}

ANSI C99 Based Language.__kernel defines entry point of kernelget_global_id(0) built-in function for calculating theglobal index identifier__global instructs the work-item where the memory hasbeen assigned for it

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 50: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Page 51: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Page 52: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

OpenCL Objectscl_platformcl_device_idcl_contextcl_command_queuecl_programcl_kernelcl_mem

Page 53: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Host Code (Platform & Device Calls)

cat /etc/OpenCL/vendors/nvidia.icdlibcuda.so

Devices (Collection of OpenCL Devices that can be used)

Page 54: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Host Code (Context Call)

Prepare Context for Execution (Context is prepared usingfunctions from the OpenCL API)

Page 55: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Squaring Array: Host Context

Figure : Host - Device Access Process (Step 0)

Figure : Host - Device Access Process (Step 3)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 56: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Host Code: Command Queue

Page 57: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Command Queue

Coordinate ExecutionPlace commands on the command_queue

Kernel execution (clEnqueueTask,clEnqueueNDRangeKernel)Memory transfers (clEnqueueReadBuffer,clEnqueueWriteBuffer)Synchronization, Profiling, etc

Scheduling MethodsIn-Order: FIFO queue. All commands execute in the orderthey appear in the queue.Out-of-Order: Commands do not wait for the previouscommand to execute.All co-ordination of commands done using event objects

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 58: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Host Code: Program Source

Program Objects (The executable units (from source) inthe kernel that can be sent to the devices)

Page 59: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Squaring Array: Host Context Objects

Figure : Host - Device Access Process (Step 4+5)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 60: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Kernels (Collection of OpenCL Functions that can beexecuted)

Page 61: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Source Code Errors?

clGetProgramBuildInfo()char buffer[2048];clGetProgramBuildInfo(program,

device,CL_PROGRAM_BUILD_LOG,sizeof(buffer),buffer,NULL);

printf("Build Log: %s\n", buffer);

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 62: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Squaring Array: Host Context Objects

Figure : Host - Device Access Process (Step 7)

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 63: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Memory Objects (Visible to both Host and GPU device)

Page 64: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Squaring Array: Completed Context

Figure : Host - Device Access Process (Step 7)

Figure : Context Sharing

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 65: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Page 66: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Squaring Array: Host Code

Page 67: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

1 Overview2 OpenCL Specification

Platform ModelExecution ModelMemory ModelProgramming Model

Memory ObjectsMemory Transfers

3 First OpenCL Program4 The Language

Vector TypesFloating Point

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 68: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

OpenCL Basis

C99 with some extensions and restrictionsWhy not C11?

Multi-threading support (brings mutexes, and CPU-specificthread storage mechanisms)Structure alignments enforced

struct { struct {char c; char c;int i; char padding[3];

} int i;}

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 69: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Restricted adaptation of C99

No variable length arrays (including structures with unsizedarrays)

int myArray[]; 4

No c99 standard headersBuilt-in functions also do not require headers

#include <stdio.h>, etc. etc. 4

No usage of extern, static specifiers

extern int var1; 4static int var2; 4

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 70: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Restricted Adaptation of C99

No recursionReturn type from __kernel will always be void

__kernel void myKernel( .. params ..) 2�

All kernel parameters must be either __global,__constant, or __localdouble precision is optional. It may/may not be supported

#pragma OPENCL EXTENSION cl_khr_fp64: enable 2�

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 71: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Scalar / Vector Data Types

Scalars (char, uchar, short, ushort, int, uint, long, ulong,float, double)Vector Types (group of scalars)

char2, float4, short8sizes (2, 4, 8, 16)

Declaratation/Initialization

float4 vectorA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vectorB = sin(vectorA);

CautionLarge Vectors increase register pressureSequential Execution on incompatible (non-SIMD) devices

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 72: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Vector Component Access

Size < 4 (.xyzw)float4 vecA = (float4)(1.0f,2.0f,3.0f,4.0f);float4 vecB = vecA.xyzw; // 1,2,3,4float4 vecC = vecA.xyxy; // 1,2,1,2

Size > 4 (.s<n>)float8 vecD = (float8)(vecA,5.0f,6.0f,7.0f,8.0f);float8 vecE = (float8)(vecA.xyzw,vecD.s4567);float16 vecF = (float16)(vecE,vecE.s76543210);float16 vecG = (float16)(vecF.sfedcba9876543210);

Odd/Even (.odd, .even)float8 vecH = (float8)(vecG.odd);

High/Low (.hi, .lo)float8 vecI = (float8)(vecG.hi);

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 73: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Mixing Scalars and Vectors

Scalar to Vector Casting (Note same address space)__kernel void myKernel(__global float *dataFromCPU) {

__global float4 *ptr = (__global float4*)dataFromCPU;int i;for (i = 0; i < {Size from CPU}/4; i++)

...ptr[i].s0123...}

Loading Scalar data to a Vectorvector vloadn(size_t offset,

__(global|constant|local|private) scalar *mem)

array = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }int4 vec1 = vload4(0, array); // 0, 1, 2, 3int4 vec2 = vload4(1, array); // 4, 5, 6, 7int4 vec3 = vload4(1, array+2); // 6, 7, 8, 9

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 74: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Mixing Scalars and Vectors

Loading Vector Data onto a Scalar Array

void vstoren(vector vec, size_t offset,__(global|local|private)scalar *mem)

vstore8(float_vector, 0, float_array);// Store at startvstore4(int_vector, 6, int_array);// Store at offset int_array+6

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 75: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

Example (Reversing an Array)

array-reverse.c__kernel void square(__global float *a, __global float *b) {

int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;__global float4 *out = (__global float4*)b;out[s-x-1] = in[x].s3210;

}

Pseudo In-Place Variant__kernel void square(__global float *a) {

int x = get_global_id(0);int s = get_global_size(0);__global float4 *in = (__global float4*)a;float4 *loc = in[x].s3210;in[s-x-1] = loc;

}

Page 76: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Example (Reversing an Array)

Memoriessize_t t = sizeof(float)*size;cl_mem dataIn, dataOut;dataIn = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);dataOut = clCreateBuffer(context, CL_MEM_READ_WRITE, t, NULL, NULL);

Kernel ArgumentsclSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&dataIn);clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dataOut);

NDRange Sizes

array-square.cglobalSize[0] = size;localSize[0] = 16;

array-reverse.cglobalSize[0] = size/4;localSize[0] = globalSize[0]/2;

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 77: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Dimension Rules

NDRange (Nx ,Ny ,Nz) can be any value (preferably 2n)Work Group Sizes (WGx ,WGy ,WGz) ≤ (Nx ,Ny ,Nz)

Exactly DivisbleNx%WGx == 0Ny%WGy == 0Nz%WGz == 0Range Barrier 11 ≤WGx ≤ 5121 ≤WGy ≤ 5121 ≤WGz ≤ 64

Max Limit is hardware dependentRange Barrier 2(WGx ×WGy ×WGz) ≤ 512

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 78: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Dimension Rules

ExamplesglobalSize [0] = 1024, localSize [0] = 2 2�globalSize [0] = 2048, localSize [0] = 1024 4globalSize [0] = 1024, localSize [0] = 2globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 1 2�globalSize [0] = 1024, localSize [0] = 256globalSize [1] = 1024, localSize [1] = 2globalSize [2] = 32, localSize [2] = 2 4

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 79: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Vector Relational Operators

Relational comparison b/w vectors return vectorizedTRUE/FALSE (-1/0)

array-compare.c__kernel void compare(__global int *a) {

int x = get_global_id(0);__global int4 *in = (__global int4*)a;in[x] = (float4)(1.0f,1.0f,1.0f,1.0f) ==

(float4)(0.0f,1.0f,2.0f,3.0f);}

Output0 -1 0 0 0 -1 0 0

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 80: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Scalar Relational Operators

Relational comparison b/w scalars return TRUE/FALSE(1/0)

array-compare.c__kernel void compare(__global int *a) {

int x = get_global_id(0);a[4*x+0] = 1.0f == 0.0f;a[4*x+1] = 1.0f == 1.0f;a[4*x+2] = 1.0f == 2.0f;a[4*x+3] = 1.0f == 3.0f;

}

Output0 1 0 0 0 1 0 0

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 81: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Vectors in If and While Structures

array-compare.c__kernel void compare(__global int *a) {

int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);if (b < c) in[x] = 2;else in[x] = 3;

}

conditions will not be visited

Solution: any/allif (any(b < c)) in[x] = 2;else in[x] = 3;

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 82: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Vectors in Ternay Operators

array-compare.c__kernel void compare(__global int *a) {

int x = get_global_id(0);__global int4 *in = (__global int4*)a;float4 b = (float4)(1.0f,1.0f,1.0f,1.0f);float4 c = (float4)(0.0f,1.0f,2.0f,3.0f);in[x] = b < c ? 1 : 2;

}

Output2 2 1 1

Scalar Syntax: <relation> ? TRUE : FALSE

Vector Syntax <relation> ? FALSE : TRUE

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 83: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Floating Point Computation Background

Data Types and RangeIEEE-754 and OpenCL

Supported: floatOptional: double & half

Optional?

IEEE 754 Number Modes1 Normal numbers2 Denormal numbers3 Infinite numbers4 NaNs

IEEE 754 Rounding ModesRound to nearest EvenRound towards +∞Round towards −∞Round towards 0

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 84: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Float

Performance optimized for floats32-bit (1 sign, 8 exponent, 23 fraction)Normal Range ≈ 1.18× 10−38 . . . 3.4× 1038

Number Modes SupportedNormalInfiniteNaN

Why not Denormal? Takes more cycles.Rounding Modes Supported

Round to Nearest Even

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 85: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Double

64-bit (1 sign, 11 exponent, 52 fraction)Normal Range ≈ 1 × 10−323.3 . . . 1 × 10308.3

Optional. To check

cl_ulong putHere;clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE,

sizeof(putHere), &putHere, NULL);

To enable

#pragma OPENCL EXTENSION cl_khr_fp64: enable

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 86: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Double

Double type supports: Normal, Denormal, Infinite, NaNWhy Denormal in Double and not in float?

(x − y) /zFor performance, disable support using -cl-denorms-are-zeroin clBuildProgram

Rounding Modes Supported: All

FMA Support

double a, b, c;a*b+c

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 87: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Half

16-bit (1 sign, 5 exponent, 10 fraction)Normal Range ≈ 6.1 × 10−5 . . . 65504

Optional. To check:CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF

To enable:#pragma OPENCL EXTENSION cl_khr_fp16: enable

Number Modes supported (Normal, Infinite, NaN)Rounding Modes supported: Either

±∞, ORNearest Even

FMA Support: No

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 88: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Default Modes

Operation Type Rounding ModeFloating Point Arithmetic RTE

Builtin functions (e.g., math) RTEcast float→int RTZcast int→float RTE

Table : Default OpenCL Rounding Modes

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 89: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

Summary

Parameter Float Double HalfNormal 2� 2� 2�

Denormalized 4 2� 4Infinity/NaN 2� 2� 2�Round Even 2� 2� ARound ±∞ 4 2�Round Zero 4 2� 4

FMA 4 2� 4Table : IEEE-754 and OpenCL Compliancy

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01

Page 90: TutorialBasedLectureNotesonOpenCL01Overview OpenCLSpecification FirstOpenCLProgram TheLanguage 1 Overview 2 OpenCLSpecification PlatformModel ExecutionModel MemoryModel ProgrammingModel

OverviewOpenCL Specification

First OpenCL ProgramThe Language

Vector TypesFloating Point

References

Omar U. Khan Tutorial Based Lecture Notes on OpenCL 01