introduction to opencl - uniroma1.ittwiki.di.uniroma1.it/pub/psmc/webhome/lecture17_intro_to... ·...

Introduction to OpenCL

Cliff Woolley, NVIDIADeveloper Technology Group

© Copyright Khronos Group, 2010

It’s a Heterogeneous World

� A modern platform includes:– One or more CPUs

– One or more GPUs

– Optional accelerators (e.g., DSPs)

GMCH = graphics memory control hubICH = Input/output control hub

GMCHGPU

ICH

CPU

DRAM

CPU

GPUCPU

GPGPU Revolutionizes ComputingLatency Processor + Throughput processor

Low Latency or High Throughput?

CPUOptimized for low-latency access to cached data setsControl logic for out-of-order and speculative execution

GPUOptimized for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation

OpenCLTM – Open Computing Language

Open, royalty-free standard C-language extension

For parallel programming of heterogeneous systems using GPUs, CPUs, CBE, DSP’s and other processors including embedded mobile devices

Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is

© NVIDIA Corporation 2009

Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is active in the working group. Working group includes NVIDIA, Intel, AMD, IBM…

Managed by Khronos Group

(same group that manages the OpenGL std)

Note: The OpenCL working group chair is NVIDIA VP Neil Trevett, who is also President of Khronos Group

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Welcome to the OpenCL Tutorial!

� OpenCL Platform Model

� OpenCL Execution Model

� Mapping the Execution Model onto the Platform Model

� Introduction to OpenCL Programming

� Additional Information and Resources

OpenCL is a trademark of Apple, Inc.


Design Goals of OpenCL

� Use all computational resources in the system— CPUs, GPUs and other processors as peers

� Efficient parallel programming model— Based on C99

— Data- and task- parallel computational model

— Abstract the specifics of underlying hardware

— Specify accuracy of floating-point computations

� Desktop and Handheld Profiles

OPENCL PLATFORM MODEL

GPU Architecture:Two Main Components

Global memoryAnalogous to RAM in a CPU serverAccessible by both GPU and CPUCurrently up to 6 GBBandwidth currently up to 150 GB/s for Quadro and Tesla productsECC on/off option for Quadro and Tesla products

Streaming Multiprocessors (SMs)Perform the actual computationsEach SM has its own:

Control units, registers, execution pipelines, caches

DRAM

I/F

Gig

a Th

read

HOST

I/F

DRAM

I/F

DRAM I/F

DRAM I/F

DRAM I/F

DRAM I/F

L2

GPU Architecture – Fermi:Streaming Multiprocessor (SM)

32 CUDA Cores per SM32 fp32 ops/clock16 fp64 ops/clock32 int32 ops/clock

2 warp schedulersUp to 1536 threads concurrently

4 special-function units64KB shared mem + L1 cache32K 32-bit registers

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

GPU Architecture – Fermi:CUDA Core

Floating point & Integer unitIEEE 754-2008 floating-point standardFused multiply-add (FMA) instruction for both single and double precision

Logic unitMove, compare unitBranch unit

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

OpenCL Platform Model

Host

Compute UnitCompute Device

………………………………

Processing Element

Computational Resources

OpenCL Platform Modelon CUDA Compute Architecture

Host

Compute UnitCompute Device

………………………………

Processing Element

CPU

CUDA-Enabled GPU

CUDA Streaming

Multiprocessor

CUDAStreaming Processor

Anatomy of an OpenCL Application

Compute Devices

OpenCL Application

Host Code• Written in C/C++• Executes on the host

Device Code• Written in OpenCL C• Executes on the device

Host

………………………………

Host code sends commands to the Devices:… to transfer data between host memory and device memories… to execute device code

Processing Flow

1. Copy input data from CPU memory to GPU memory

PCIe Bus

Processing Flow


2. Load GPU program and execute,caching data on chip for performance

PCIe Bus

Processing Flow


2. Load GPU program and execute,caching data on chip for performance

3. Copy results from GPU memory to CPU memory

PCIe Bus

Anatomy of an OpenCL Application� Serial code executes in a Host (CPU) thread

� Parallel code executes in many Device (GPU) threadsacross multiple processing elements

OpenCL Application

Serial code

Serial code

Parallel code

Parallel code

Device = GPU

…

Host = CPU

Device = GPU

...

Host = CPU

OpenCL Language & API Highlights

Platform Layer API (called from host)Abstraction layer for diverse computational resources

Query, select and initialize compute devices

Create compute contexts and work-queues

Runtime API (called from host) Launch compute kernels


Launch compute kernels

Set kernel execution configuration

Manage scheduling, compute, and memory resources

OpenCL LanguageWrite compute kernels that run on a compute device

C-based cross-platform programming interface

Subset of ISO C99 with language extensions

Includes rich set of built-in functions, in addition to standard C operators

Can be compiled JIT/Online or offline

OPENCL EXECUTION MODEL


Decompose task into work-items

� Define N-dimensional computation domain

� Execute a kernel at each point in computation domain

voidtrad_mul(int n,

const float *a, const float *b, float *c)

{int i;for (i=0; i<n; i++)c[i] = a[i] * b[i];

}

Traditional loop as a function in C

__kernel voiddp_mul(__global const float *a,

__global const float *b, __global float *c)

{int id = get_global_id(0);

c[id] = a[id] * b[id];} // execute over n “work items”

OpenCL C kernel

Kernel Execution Configuration

Host program launches kernel in index space called NDRangeNDRange (“N-Dimensional Range”) is a multitude of kernel instances

arranged into 1, 2 or 3 dimensions

A single kernel instance in the index space is called a Work ItemEach Work Item executes same compute kernel (on different data)

Work Items have unique global IDs from the index space


Work-items are further grouped into Work GroupsWork Groups have a unique Work Group ID

Work Items have a unique Local ID within a Work Group

~ Analagous to a C loop that calls a function many timesExcept all iterations are called simultaneously & executed in parallel


An N-dimension domain of work-items

Define the “best” N-dimensioned index space for your algorithm• Kernels are executed across a

global domain of work-items

• Work-items are grouped into local work-groups

– Global Dimensions: 1024 x 1024 (whole problem space)

– Local Dimensions: 32 x 32(work-group … executes together)

1024

1024


OpenCL Execution Model

The application runs on a Host which submits work to the Devices

� Work-item: the basic unit of work on an OpenCL device

� Kernel: the code for a work-item (basically a C function)

� Program: Collection of kernels and other functions (analogous to a dynamic library)


OpenCL Execution Model

• Context: The environment within which work-items execute; includes devices and their memories and command queues

• Command Queue: A queue used by the Host application to submit work to a Device (e.g., kernel execution instances)

– Work is queued in-order, one queue per device

– Work can be executed in-order or out-of-order

Context

Device Device

The application runs on a Host which submits work to the Devices

Queue Queue

MAPPING THE EXECUTION MODELONTO THE PLATFORM MODEL

Kernel Execution on Platform Model

• Each kernel is executed on a compute device

………

Compute device(CUDA-enabled GPU)

Work-Item(CUDA thread) • Each work-item is executed

by a compute element

Compute element(CUDA core)

Work-Group(CUDA thread block)

• Each work-group is executed on a compute unit

• Several concurrent work-groups can reside on one compute unit depending on work-group’s memory requirements and compute unit’s memory resources

…

Compute unit(CUDA Streaming Multiprocessor)

Kernel execution instance(CUDA kernel grid)

...


OpenCL Memory Model

• Private Memory–Per work-item

• Local Memory–Shared within a workgroup

• Global/Constant Memory–Visible to all workgroups

• Host Memory–On the CPU

Memory management is ExplicitYou must move data from host -> global -> local … and back

Workgroup

Work-Item

Compute Device

Work-Item

Workgroup

Host

Private Memory

Private Memory

Local MemoryLocal Memory

Global/Constant Memory

Host Memory

Work-ItemWork-Item

Private Memory

Private Memory

introduction to opencl - uniroma1.ittwiki.di.uniroma1.it/pub/psmc/webhome/lecture17_intro_to... ·...

Documents