introduction to opencl - uniroma1.ittwiki.di.uniroma1.it/pub/psmc/webhome/lecture17_intro_to... ·...
TRANSCRIPT
![Page 1: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/1.jpg)
Introduction to OpenCL
Cliff Woolley, NVIDIADeveloper Technology Group
![Page 2: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/2.jpg)
© Copyright Khronos Group, 2010
It’s a Heterogeneous World
� A modern platform includes:– One or more CPUs
– One or more GPUs
– Optional accelerators (e.g., DSPs)
GMCH = graphics memory control hubICH = Input/output control hub
GMCHGPU
ICH
CPU
DRAM
CPU
![Page 3: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/3.jpg)
GPUCPU
GPGPU Revolutionizes ComputingLatency Processor + Throughput processor
![Page 4: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/4.jpg)
Low Latency or High Throughput?
CPUOptimized for low-latency access to cached data setsControl logic for out-of-order and speculative execution
GPUOptimized for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation
![Page 5: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/5.jpg)
OpenCLTM – Open Computing Language
Open, royalty-free standard C-language extension
For parallel programming of heterogeneous systems using GPUs, CPUs, CBE, DSP’s and other processors including embedded mobile devices
Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is
© NVIDIA Corporation 2009
Initially proposed by Apple, who put OpenCL in OSX Snow Leopard and is active in the working group. Working group includes NVIDIA, Intel, AMD, IBM…
Managed by Khronos Group
(same group that manages the OpenGL std)
Note: The OpenCL working group chair is NVIDIA VP Neil Trevett, who is also President of Khronos Group
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
![Page 6: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/6.jpg)
Welcome to the OpenCL Tutorial!
� OpenCL Platform Model
� OpenCL Execution Model
� Mapping the Execution Model onto the Platform Model
� Introduction to OpenCL Programming
� Additional Information and Resources
OpenCL is a trademark of Apple, Inc.
![Page 7: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/7.jpg)
© Copyright Khronos Group, 2010
Design Goals of OpenCL
� Use all computational resources in the system— CPUs, GPUs and other processors as peers
� Efficient parallel programming model— Based on C99
— Data- and task- parallel computational model
— Abstract the specifics of underlying hardware
— Specify accuracy of floating-point computations
� Desktop and Handheld Profiles
![Page 8: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/8.jpg)
OPENCL PLATFORM MODEL
![Page 9: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/9.jpg)
GPU Architecture:Two Main Components
Global memoryAnalogous to RAM in a CPU serverAccessible by both GPU and CPUCurrently up to 6 GBBandwidth currently up to 150 GB/s for Quadro and Tesla productsECC on/off option for Quadro and Tesla products
Streaming Multiprocessors (SMs)Perform the actual computationsEach SM has its own:
Control units, registers, execution pipelines, caches
DRAM
I/F
Gig
a Th
read
HOST
I/F
DRAM
I/F
DRAM I/F
DRAM I/F
DRAM I/F
DRAM I/F
L2
![Page 10: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/10.jpg)
GPU Architecture – Fermi:Streaming Multiprocessor (SM)
32 CUDA Cores per SM32 fp32 ops/clock16 fp64 ops/clock32 int32 ops/clock
2 warp schedulersUp to 1536 threads concurrently
4 special-function units64KB shared mem + L1 cache32K 32-bit registers
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
![Page 11: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/11.jpg)
GPU Architecture – Fermi:CUDA Core
Floating point & Integer unitIEEE 754-2008 floating-point standardFused multiply-add (FMA) instruction for both single and double precision
Logic unitMove, compare unitBranch unit
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
![Page 12: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/12.jpg)
OpenCL Platform Model
Host
Compute UnitCompute Device
………………………………
Processing Element
Computational Resources
![Page 13: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/13.jpg)
OpenCL Platform Modelon CUDA Compute Architecture
Host
Compute UnitCompute Device
………………………………
Processing Element
CPU
CUDA-Enabled GPU
CUDA Streaming
Multiprocessor
CUDAStreaming Processor
![Page 14: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/14.jpg)
Anatomy of an OpenCL Application
Compute Devices
OpenCL Application
Host Code• Written in C/C++• Executes on the host
Device Code• Written in OpenCL C• Executes on the device
Host
………………………………
Host code sends commands to the Devices:… to transfer data between host memory and device memories… to execute device code
![Page 15: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/15.jpg)
Processing Flow
1. Copy input data from CPU memory to GPU memory
PCIe Bus
![Page 16: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/16.jpg)
Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
PCIe Bus
![Page 17: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/17.jpg)
Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
3. Copy results from GPU memory to CPU memory
PCIe Bus
![Page 18: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/18.jpg)
Anatomy of an OpenCL Application� Serial code executes in a Host (CPU) thread
� Parallel code executes in many Device (GPU) threadsacross multiple processing elements
OpenCL Application
Serial code
Serial code
Parallel code
Parallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
![Page 19: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/19.jpg)
OpenCL Language & API Highlights
Platform Layer API (called from host)Abstraction layer for diverse computational resources
Query, select and initialize compute devices
Create compute contexts and work-queues
Runtime API (called from host) Launch compute kernels
© NVIDIA Corporation 2009
Launch compute kernels
Set kernel execution configuration
Manage scheduling, compute, and memory resources
OpenCL LanguageWrite compute kernels that run on a compute device
C-based cross-platform programming interface
Subset of ISO C99 with language extensions
Includes rich set of built-in functions, in addition to standard C operators
Can be compiled JIT/Online or offline
![Page 20: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/20.jpg)
OPENCL EXECUTION MODEL
![Page 21: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/21.jpg)
© Copyright Khronos Group, 2010
Decompose task into work-items
� Define N-dimensional computation domain
� Execute a kernel at each point in computation domain
voidtrad_mul(int n,
const float *a, const float *b, float *c)
{int i;for (i=0; i<n; i++)c[i] = a[i] * b[i];
}
Traditional loop as a function in C
__kernel voiddp_mul(__global const float *a,
__global const float *b, __global float *c)
{int id = get_global_id(0);
c[id] = a[id] * b[id];} // execute over n “work items”
OpenCL C kernel
![Page 22: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/22.jpg)
Kernel Execution Configuration
Host program launches kernel in index space called NDRangeNDRange (“N-Dimensional Range”) is a multitude of kernel instances
arranged into 1, 2 or 3 dimensions
A single kernel instance in the index space is called a Work ItemEach Work Item executes same compute kernel (on different data)
Work Items have unique global IDs from the index space
© NVIDIA Corporation 2009
Work-items are further grouped into Work GroupsWork Groups have a unique Work Group ID
Work Items have a unique Local ID within a Work Group
~ Analagous to a C loop that calls a function many timesExcept all iterations are called simultaneously & executed in parallel
![Page 23: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/23.jpg)
© Copyright Khronos Group, 2010
An N-dimension domain of work-items
Define the “best” N-dimensioned index space for your algorithm• Kernels are executed across a
global domain of work-items
• Work-items are grouped into local work-groups
– Global Dimensions: 1024 x 1024 (whole problem space)
– Local Dimensions: 32 x 32(work-group … executes together)
1024
1024
![Page 24: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/24.jpg)
© Copyright Khronos Group, 2010
OpenCL Execution Model
The application runs on a Host which submits work to the Devices
� Work-item: the basic unit of work on an OpenCL device
� Kernel: the code for a work-item (basically a C function)
� Program: Collection of kernels and other functions (analogous to a dynamic library)
![Page 25: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/25.jpg)
© Copyright Khronos Group, 2010
OpenCL Execution Model
• Context: The environment within which work-items execute; includes devices and their memories and command queues
• Command Queue: A queue used by the Host application to submit work to a Device (e.g., kernel execution instances)
– Work is queued in-order, one queue per device
– Work can be executed in-order or out-of-order
Context
Device Device
The application runs on a Host which submits work to the Devices
Queue Queue
![Page 26: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/26.jpg)
MAPPING THE EXECUTION MODELONTO THE PLATFORM MODEL
![Page 27: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/27.jpg)
Kernel Execution on Platform Model
• Each kernel is executed on a compute device
………
Compute device(CUDA-enabled GPU)
Work-Item(CUDA thread) • Each work-item is executed
by a compute element
Compute element(CUDA core)
Work-Group(CUDA thread block)
• Each work-group is executed on a compute unit
• Several concurrent work-groups can reside on one compute unit depending on work-group’s memory requirements and compute unit’s memory resources
…
Compute unit(CUDA Streaming Multiprocessor)
Kernel execution instance(CUDA kernel grid)
...
![Page 28: Introduction to OpenCL - uniroma1.ittwiki.di.uniroma1.it/pub/PSMC/WebHome/Lecture17_intro_to... · 2015. 11. 24. · GPU Architecture –Fermi: Streaming Multiprocessor (SM) 32 CUDA](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5fc2e1581490c1451c035674/html5/thumbnails/28.jpg)
© Copyright Khronos Group, 2010
OpenCL Memory Model
• Private Memory–Per work-item
• Local Memory–Shared within a workgroup
• Global/Constant Memory–Visible to all workgroups
• Host Memory–On the CPU
Memory management is ExplicitYou must move data from host -> global -> local … and back
Workgroup
Work-Item
Compute Device
Work-Item
Workgroup
Host
Private Memory
Private Memory
Local MemoryLocal Memory
Global/Constant Memory
Host Memory
Work-ItemWork-Item
Private Memory
Private Memory