Introduction to OpenCL
Cliff Woolley, NVIDIADeveloper Technology Group
© Copyright Khronos Group, 2010
It’s a Heterogeneous World
� A modern platform includes:– One or more CPUs
– One or more GPUs
– Optional accelerators (e.g., DSPs)
GMCH = graphics memory control hubICH = Input/output control hub
GMCHGPU
ICH
CPU
DRAM
CPU
GPUCPU
GPGPU Revolutionizes ComputingLatency Processor + Throughput processor
Low Latency or High Throughput?
CPUOptimized for low-latency access to cached data setsControl logic for out-of-order and speculative execution
GPUOptimized for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation
OpenCLTM – Open Computing Language
Open, royalty-free standard C-language extension
For parallel programming of heterogeneous systems using GPUs, CPUs, CBE, DSP’s and other processors including embedded mobile devices
Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is
© NVIDIA Corporation 2009
Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is active in the working group. Working group includes NVIDIA, Intel, AMD, IBM…
Managed by Khronos Group
(same group that manages the OpenGL std)
Note: The OpenCL working group chair is NVIDIA VP Neil Trevett, who is also President of Khronos Group
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
Welcome to the OpenCL Tutorial!
� OpenCL Platform Model
� OpenCL Execution Model
� Mapping the Execution Model onto the Platform Model
� Introduction to OpenCL Programming
� Additional Information and Resources
OpenCL is a trademark of Apple, Inc.
© Copyright Khronos Group, 2010
Design Goals of OpenCL
� Use all computational resources in the system— CPUs, GPUs and other processors as peers
� Efficient parallel programming model— Based on C99
— Data- and task- parallel computational model
— Abstract the specifics of underlying hardware
— Specify accuracy of floating-point computations
� Desktop and Handheld Profiles
OPENCL PLATFORM MODEL
GPU Architecture:Two Main Components
Global memoryAnalogous to RAM in a CPU serverAccessible by both GPU and CPUCurrently up to 6 GBBandwidth currently up to 150 GB/s for Quadro and Tesla productsECC on/off option for Quadro and Tesla products
Streaming Multiprocessors (SMs)Perform the actual computationsEach SM has its own:
Control units, registers, execution pipelines, caches
DRAM
I/F
Gig
a Th
read
HOST
I/F
DRAM
I/F
DRAM I/F
DRAM I/F
DRAM I/F
DRAM I/F
L2
GPU Architecture – Fermi:Streaming Multiprocessor (SM)
32 CUDA Cores per SM32 fp32 ops/clock16 fp64 ops/clock32 int32 ops/clock
2 warp schedulersUp to 1536 threads concurrently
4 special-function units64KB shared mem + L1 cache32K 32-bit registers
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
GPU Architecture – Fermi:CUDA Core
Floating point & Integer unitIEEE 754-2008 floating-point standardFused multiply-add (FMA) instruction for both single and double precision
Logic unitMove, compare unitBranch unit
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
OpenCL Platform Model
Host
Compute UnitCompute Device
………………………………
Processing Element
Computational Resources
OpenCL Platform Modelon CUDA Compute Architecture
Host
Compute UnitCompute Device
………………………………
Processing Element
CPU
CUDA-Enabled GPU
CUDA Streaming
Multiprocessor
CUDAStreaming Processor
Anatomy of an OpenCL Application
Compute Devices
OpenCL Application
Host Code• Written in C/C++• Executes on the host
Device Code• Written in OpenCL C• Executes on the device
Host
………………………………
Host code sends commands to the Devices:… to transfer data between host memory and device memories… to execute device code
Processing Flow
1. Copy input data from CPU memory to GPU memory
PCIe Bus
Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
PCIe Bus
Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
3. Copy results from GPU memory to CPU memory
PCIe Bus
Anatomy of an OpenCL Application� Serial code executes in a Host (CPU) thread
� Parallel code executes in many Device (GPU) threadsacross multiple processing elements
OpenCL Application
Serial code
Serial code
Parallel code
Parallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
OpenCL Language & API Highlights
Platform Layer API (called from host)Abstraction layer for diverse computational resources
Query, select and initialize compute devices
Create compute contexts and work-queues
Runtime API (called from host) Launch compute kernels
© NVIDIA Corporation 2009
Launch compute kernels
Set kernel execution configuration
Manage scheduling, compute, and memory resources
OpenCL LanguageWrite compute kernels that run on a compute device
C-based cross-platform programming interface
Subset of ISO C99 with language extensions
Includes rich set of built-in functions, in addition to standard C operators
Can be compiled JIT/Online or offline
OPENCL EXECUTION MODEL
© Copyright Khronos Group, 2010
Decompose task into work-items
� Define N-dimensional computation domain
� Execute a kernel at each point in computation domain
voidtrad_mul(int n,
const float *a, const float *b, float *c)
{int i;for (i=0; i<n; i++)c[i] = a[i] * b[i];
}
Traditional loop as a function in C
__kernel voiddp_mul(__global const float *a,
__global const float *b, __global float *c)
{int id = get_global_id(0);
c[id] = a[id] * b[id];} // execute over n “work items”
OpenCL C kernel
Kernel Execution Configuration
Host program launches kernel in index space called NDRangeNDRange (“N-Dimensional Range”) is a multitude of kernel instances
arranged into 1, 2 or 3 dimensions
A single kernel instance in the index space is called a Work ItemEach Work Item executes same compute kernel (on different data)
Work Items have unique global IDs from the index space
© NVIDIA Corporation 2009
Work-items are further grouped into Work GroupsWork Groups have a unique Work Group ID
Work Items have a unique Local ID within a Work Group
~ Analagous to a C loop that calls a function many timesExcept all iterations are called simultaneously & executed in parallel
© Copyright Khronos Group, 2010
An N-dimension domain of work-items
Define the “best” N-dimensioned index space for your algorithm• Kernels are executed across a
global domain of work-items
• Work-items are grouped into local work-groups
– Global Dimensions: 1024 x 1024 (whole problem space)
– Local Dimensions: 32 x 32(work-group … executes together)
1024
1024
© Copyright Khronos Group, 2010
OpenCL Execution Model
The application runs on a Host which submits work to the Devices
� Work-item: the basic unit of work on an OpenCL device
� Kernel: the code for a work-item (basically a C function)
� Program: Collection of kernels and other functions (analogous to a dynamic library)
© Copyright Khronos Group, 2010
OpenCL Execution Model
• Context: The environment within which work-items execute; includes devices and their memories and command queues
• Command Queue: A queue used by the Host application to submit work to a Device (e.g., kernel execution instances)
– Work is queued in-order, one queue per device
– Work can be executed in-order or out-of-order
Context
Device Device
The application runs on a Host which submits work to the Devices
Queue Queue
MAPPING THE EXECUTION MODELONTO THE PLATFORM MODEL
Kernel Execution on Platform Model
• Each kernel is executed on a compute device
………
Compute device(CUDA-enabled GPU)
Work-Item(CUDA thread) • Each work-item is executed
by a compute element
Compute element(CUDA core)
Work-Group(CUDA thread block)
• Each work-group is executed on a compute unit
• Several concurrent work-groups can reside on one compute unit depending on work-group’s memory requirements and compute unit’s memory resources
…
Compute unit(CUDA Streaming Multiprocessor)
Kernel execution instance(CUDA kernel grid)
...
© Copyright Khronos Group, 2010
OpenCL Memory Model
• Private Memory–Per work-item
• Local Memory–Shared within a workgroup
• Global/Constant Memory–Visible to all workgroups
• Host Memory–On the CPU
Memory management is ExplicitYou must move data from host -> global -> local … and back
Workgroup
Work-Item
Compute Device
Work-Item
Workgroup
Host
Private Memory
Private Memory
Local MemoryLocal Memory
Global/Constant Memory
Host Memory
Work-ItemWork-Item
Private Memory
Private Memory