© Copyright Khronos Group, 2013 - Page 1
OpenCL on Intel® Iris™ Graphics
SIGGRAPH 2013 July 2013
Presenter: Adam Lake
Content: Ben Ashbaugh, Arnon Peleg, others
© Copyright Khronos Group, 2013 - Page 2
Agenda
•Intel® Iris™ Graphics Architecture Overview
•Tools to help get the most from OpenCL on Intel CPU
and GPU
•Additional Resources
2
© Copyright Khronos Group, 2013 - Page 3
Intel® Iris™ Graphics Architecture Overview
3
Rin
g B
us
/ LL
C /
Mem
ory
CommandStreamer
(CS)
Vertex Fetch(VF)
Vertex Shader
(VS)
Hull Shader(HS)
Tessellator
Domain Shader
(DS)
Geometry Shader
(GS)
Stream Out(SOL)
Clip/Setup
Thread D
ispatch
Video Front End
(VFE)
Video QualityEngine
Multi-FormatCODEC
Blitter Display
Slice 0
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 1
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 0
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Slice 1
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 3
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 2
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
© Copyright Khronos Group, 2013 - Page 4
Intel® Iris™ Graphics Architecture Overview
4
Rin
g B
us
/ LL
C /
Mem
ory
CommandStreamer
(CS)
Vertex Fetch(VF)
Vertex Shader
(VS)
Hull Shader(HS)
Tessellator
Domain Shader
(DS)
Geometry Shader
(GS)
Stream Out(SOL)
Clip/Setup
Thread D
ispatch
Video Front End
(VFE)
Video QualityEngine
Multi-FormatCODEC
Blitter Display
Slice 0
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 1
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 0
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Slice 1
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 3
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 2
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Global Assets • Command Streamer • Thread Dispatch
© Copyright Khronos Group, 2013 - Page 5
Intel® Iris™ Graphics Architecture Overview
5
Rin
g B
us
/ LL
C /
Mem
ory
CommandStreamer
(CS)
Vertex Fetch(VF)
Vertex Shader
(VS)
Hull Shader(HS)
Tessellator
Domain Shader
(DS)
Geometry Shader
(GS)
Stream Out(SOL)
Clip/Setup
Thread D
ispatch
Video Front End
(VFE)
Video QualityEngine
Multi-FormatCODEC
Blitter Display
Slice 0
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 1
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 0
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Slice 1
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 3
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 2
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Sub Slice • Execution Units • Samplers and Data Port • Instruction and Texture Caches
© Copyright Khronos Group, 2013 - Page 6
Intel® Iris™ Graphics Architecture Overview
6
Rin
g B
us
/ LL
C /
Mem
ory
CommandStreamer
(CS)
Vertex Fetch(VF)
Vertex Shader
(VS)
Hull Shader(HS)
Tessellator
Domain Shader
(DS)
Geometry Shader
(GS)
Stream Out(SOL)
Clip/Setup
Thread D
ispatch
Video Front End
(VFE)
Video QualityEngine
Multi-FormatCODEC
Blitter Display
Slice 0
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 1
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 0
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Slice 1
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 3
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e 2
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Slice Common • L3 Cache • Shared Local Memory • Barriers
© Copyright Khronos Group, 2013 - Page 7
Intel® Iris™ Graphics Architecture Building Blocks
• OpenCL* Kernels run on an Execution Unit (EU)
• Each EU is a Multi-Threaded SIMD Processor
• Up to 7 threads per EU
- 128 x 8 x 32-bit registers per thread
• Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled)
- “SIMD8”, “SIMD16”, “SIMD32”
- SIMD8 More Registers
- SIMD16 and SIMD32 Better Efficiency
7
EUThread 0
Thread 2
Thread 4
Thread 6
Thread 1`
Thread 3
Thread 5
© Copyright Khronos Group, 2013 - Page 8
Intel® Iris™ Graphics Architecture Building Blocks
• OpenCL* Work Groups run on a
Sub Slice
- 10 EUs per Sub Slice
- Texture Sampler (Images)
- Data Port (Buffers)
- Instruction and Texture Caches
8
OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!
Sub
Slic
e
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
EU EU EU EU EU
EUEUEUEUEU
© Copyright Khronos Group, 2013 - Page 9
Slice
L3$ Pixel OpsRasterizer /
DepthRender$Depth$
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
Sub
Slic
e
EU EU EU EU EU
EUEUEUEUEU
EU EU EU EU EU
EUEUEUEUEU
Intel® Iris™ Graphics Architecture Building Blocks
• Two Sub Slices make a Slice
• Shared Resources: “Slice Common”
- L3 Cache + Shared Local Memory
- Barriers
• Intel® Iris™ Graphics has Two Slices
• How many work items in flight at once?
- 2 slides each with 2 subslices = 4 Sub Slices
- 4 subslices x 10 EUs/subslice = 40 EUs
- Up to 40 EUs x 7 threads/EU = 280 EU threads
- Up to 8960 OpenCL* work items in flight!
9
© Copyright Khronos Group, 2013 - Page 10
EU
Sampler L1
L3
Sampler L2
LLC DRAM
images
buffers
256KB/slice 2-8MB/package
(shared w/ CPU)
EDRAM (non-inclusive victim cache)
128MB/package
(Intel® Iris™ Pro 5200)
Intel® Iris™ Graphics Cache Hierarchy
10
© Copyright Khronos Group, 2013 - Page 11
Tools and Additional Resources
© Copyright Khronos Group, 2013 - Page 12
Occupancy – Intel® VTune™ Amplifier XE 2013
12
© Copyright Khronos Group, 2013 - Page 13
Additional Resources • Intel® SDK for OpenCL* Applications 2013
- Offline kernel builder, Debugger for CPU in MSVC
• Intel® OpenCL* Optimization Guide
• Intel® VTune™ Amplifier XE 2013
• Intel® Graphics Performance Analyzers - OpenCL Tuning
• Intel Linux Graphics hardware Bspec: https://01.org/linuxgraphics/
• SIGGRAPH 2013: - Faster, Better Pixels on the Go and in the Cloud with OpenCL* on Intel® Architecture
- Arnon Peleg - Optimizing OpenCL* Applications for Intel® Iris™ Graphics
- Ben Ashbaugh - Backup of this deck contains more from Ben Ashbaugh’s excellent SIGGRAPH 2012 talk on
optimizing for Intel® Iris™ Graphics!
13
© Copyright Khronos Group, 2013 - Page 14
Questions
© Copyright Khronos Group, 2013 - Page 15
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE
FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE
INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND
THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF,
DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR
NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third
parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole
risk of the user.
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product
roadmaps.
Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. For more information go to
http://www.Intel.com/performance
Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, and Ultrabook are trademarks of Intel Corporation in the United States and other countries
Legal
© Copyright Khronos Group, 2013 - Page 16
backup
© Copyright Khronos Group, 2013 - Page 17
Scheduling EUs for maximum occupancy
© Copyright Khronos Group, 2013 - Page 18
Occupancy
•Goal: Use All Execution Unit Resources
•This is harder than it sounds! Many factors to
consider… - Launch Enough Work
- More EU threads means better latency coverage to keep
an EU active - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled
18
© Copyright Khronos Group, 2013 - Page 19
Occupancy
•Goal: Use All Execution Unit Resources - Don’t Waste SIMD Lanes
- Use an optimal Local Work Size
- Good: Query for compiled SIMD size:
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
- Occasionally Helpful: Compile for a specific local work size
(8, 16, or 32):
__attribute__((reqd_work_group_size(X, Y, Z)))
- Best: Let the driver pick (Local Work Size == NULL)
Ideal for kernels with no barriers or shared local memory
19
© Copyright Khronos Group, 2013 - Page 20
Occupancy (continued…)
•Barriers - 16 barriers per sub slice
- Can be a limiting factor for very small local work groups
•Shared Local Memory - 64KB shared local memory per sub slice
- Can be a limiting factor for kernels that use lots of shared
local memory
20
© Copyright Khronos Group, 2013 - Page 21
Optimizing OpenCL* Applications
for Intel® Iris™ Graphics Ben Ashbaugh (Intel)
© Copyright Khronos Group, 2013 - Page 22
Agenda
•Understanding Occupancy - How Intel® Iris™ Graphics executes OpenCL* Kernels
•-
-
•-
•
22
© Copyright Khronos Group, 2013 - Page 23
How Intel® Iris™ Graphics Runs OpenCL*
© Copyright Khronos Group, 2013 - Page 24
How Intel® Iris™ Graphics Runs OpenCL*
24
1. Divide Into Work Groups
© Copyright Khronos Group, 2013 - Page 25
How Intel® Iris™ Graphics Runs OpenCL*
25
2. Divide Each Work Group Into EU Threads
© Copyright Khronos Group, 2013 - Page 26
Sub
Slic
e
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
EU EU EU EU EU
EUEUEUEUEU
How Intel® Iris™ Graphics Runs OpenCL*
26
3. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait
© Copyright Khronos Group, 2013 - Page 27
How Intel® Iris™ Graphics Runs OpenCL*
27
• 1. Divide Into Work Groups
© Copyright Khronos Group, 2013 - Page 28
How Intel® Iris™ Graphics Runs OpenCL*
28
•2. Divide Each Work Group Into EU Threads
© Copyright Khronos Group, 2013 - Page 29
Sub
Slic
e
L1IC$
3D Sampler
Media Sampler
Data Port
Tex$
EU EU EU EU EU
EUEUEUEUEU
How Intel® Iris™ Graphics Runs OpenCL*
29
•2. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait
© Copyright Khronos Group, 2013 - Page 30
Occupancy
•Goal: Use All Machine Resources
•This is harder than it sounds! Many factors to
consider…
1. Launch Enough Work - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled - More EU threads better latency coverage keeps an EU
active
30
© Copyright Khronos Group, 2013 - Page 31
Occupancy – Intel® VTune™ Amplifier XE 2013
31
© Copyright Khronos Group, 2013 - Page 32
Agenda
•-
•Memory Matters - Host to Device
-
•-
32
© Copyright Khronos Group, 2013 - Page 33
Optimizing Host to Device Transfers
33
•Host (CPU) and Device (GPU) share the same physical
memory
•For OpenCL* buffers: - No transfer needed (zero copy)!
- Allocate system memory aligned to a cache line (64 bytes)
- Create buffer with system memory pointer and
CL_MEM_USE_HOST_PTR
- Use clEnqueueMapBuffer() to access data
•For OpenCL* images: - Currently tiled in device memory transfer required
© Copyright Khronos Group, 2013 - Page 34
Operating on Buffers as Images
34
•Intel® Iris™ Graphics supports
cl_khr_image2d_from_buffer
- New OpenCL* 1.2 Extension
- Treat data as a buffer for some kernels, as an image for others
- Some restrictions for zero copy: buffer size, image pitch
0x123 0x456 0x789
Buffer Image
© Copyright Khronos Group, 2013 - Page 35
Interop with Graphics and Media APIs / SDKs
35
•Intel® Iris™ Graphics supports many Graphics and
Media interop extensions: - cl_khr_dx9_media_sharing (includes DXVA for Intel® Media SDK)
- cl_khr_d3d10_sharing
- cl_khr_d3d11_sharing
- cl_khr_gl_sharing
- cl_khr_gl_depth_images
- cl_khr_gl_event
- cl_khr_gl_msaa_sharing
Use Graphics API / SDK assets in OpenCL* with no copies!
© Copyright Khronos Group, 2013 - Page 36
Agenda
•-
•Memory Matters -
- Device Access
•-
36
© Copyright Khronos Group, 2013 - Page 37
EU
Sampler L1
L3
Sampler L2
LLC DRAM
images
buffers
256KB/slice 2-8MB/package
(shared w/ CPU)
EDRAM (non-inclusive victim cache)
128MB/package
(Intel® Iris™ Pro 5200)
Intel® Iris™ Graphics Cache Hierarchy
37
© Copyright Khronos Group, 2013 - Page 38
__global and __constant Memory • Global Memory Accesses go through the L3 Cache
• L3 Cache Line is 64 bytes
• EU thread accesses to the same L3 Cache Line are
collapsed
- Order of data within cache line does not matter
- Bandwidth determined by number of cache lines
accessed
- Maximum Bandwidth: 64 bytes / clock / sub slice
• Good: Load at least 32-bits of data at a time,
starting from a 32-bit aligned address
• Best: Load 4 x 32-bits of data at a time, starting
from a cache line aligned address
- Loading more than 4 x 32-bits of data is not beneficial
38
© Copyright Khronos Group, 2013 - Page 39
Global and Constant Memory Access Examples
1. x = data[ get_global_id(0) ]
- One cache line, full bandwidth
2. x = data[ n – get_global_id(0) ]
- Reverse order, full bandwidth
3. x = data[ get_global_id(0) + 1 ]
- Offset, two cache lines, half bandwidth
4. x = data[ get_global_id(0) * 2 ]
- Strided, half bandwidth
5. x = data[ get_global_id(0) * 16 ]
- Very strided, worst-case
39
Cache Line n
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cache Line n + 1
Global ID:
Cache Line n - 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Cache Line n
Global ID:
Cache Line n
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cache Line n + 1
Global ID:
Cache Line n
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cache Line n + 1
Global ID:
Cache Line n
0 1 2 ...
Cache Line n + 1 Cache Line n + 2 ...
Global ID:
© Copyright Khronos Group, 2013 - Page 40
__local Memory Accesses • Local Memory Accesses also go through the
L3 Cache!
• Key Difference: Local Memory is Banked
- Banked at a DWORD granularity, 16 banks
- Bandwidth determined by number of bank
conflicts
- Maximum Bandwidth: Still 64 bytes / clock /
sub slice
• Supports more access patterns with full
bandwidth than Global Memory
- No bank conflicts full bandwidth
- Reading from the same address in a bank
full bandwidth
40
© Copyright Khronos Group, 2013 - Page 41
Local Memory Access Examples
1. x = data[ get_global_id(0) + 1 ]
- Unique banks, full bandwidth
2. x = data[ get_global_id(0) & ~1 ]
- Same address read, full bandwidth
3. x = data[ get_global_id(0) * 2 ]
- Strided, half bandwidth
4. x = data[ get_global_id(0) * 16 ]
- Very strided, worst-case
5. x = data[ get_global_id(0) * 17 ]
- Full bandwidth!
41
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1
Bank:2 3 4 5 6 7 8 9 10 11 12 13 14 15
Global ID:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1
Bank:
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Global ID:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1
Bank:
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Global ID:
0 1 2 3 4 5 6
0
...
0 0 0 0
Bank:
0 0 0
Global ID:
... ... ... ... ... ... ...
7
0 1 2 3 4 5 6
0
...
1 2 3 4
Bank:
5 6
Global ID:
... ... ... ... ... ...
© Copyright Khronos Group, 2013 - Page 42
__private Memory
• Compiler can usually allocate Private
Memory in the Register File
- Even if Private Memory is dynamically indexed
- Good Performance
• Fallback: Private Memory allocated in
Global Memory
- Accesses are very strided
- Bad Performance
__private int
a[100]
…
…
EU
Th
read
n-1
EU
Th
read
n
EU
Th
read
n+
1
Work Item 0
Work Item 1
…
Work Item n
__private int
b[100]
__private int
c[200]
Work Item 0
Work Item 1
…
Work Item n
Work Item 0
Work Item 1
…
Work Item n
© Copyright Khronos Group, 2013 - Page 43
Agenda
•-
•-
-
•Compute Characteristics - Maximizing GFlops
43
© Copyright Khronos Group, 2013 - Page 44
ISA
•“SIMT” ISA with Predication and Branching
•“Divergent” code executes both branches Reduced
SIMD Efficiency
44
this();
if ( x )
that();
else
another();
finish();
SIMD lane
time
Example: “x” sometimes true
SIMD lane tim
e
Example: “x” never true
© Copyright Khronos Group, 2013 - Page 45
Compute GFlops
45
•EUs have 2 x 4-wide vector ALUs
•Second ALU has limitations: - Subset of instructions: add, mov, mad, mul, cmp
- Instruction must come from another EU thread
- Only float operands!
•Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL +
ADD ) x Clock Rate
For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!
Add Intel® Core™ Host Processor >1TFlop!
© Copyright Khronos Group, 2013 - Page 46
Maximizing Compute Performance
46
•Use mad() / fma(): Either explicitly with built-ins, or
via -cl-mad-enable
•Use floats wherever possible to maximize co-issue
•Avoid long and size_t data types - Prefer float over int, if possible
- Using short data types may improve performance
•Trade accuracy for speed: “native” built-ins, -cl-fast-
relaxed-math - Often good enough for graphics
© Copyright Khronos Group, 2013 - Page 47
Agenda
•-
•-
-
•-
Summary / Questions
47
© Copyright Khronos Group, 2013 - Page 48
Summary
•Maximize Occupancy - Choose a Good Local Work Size - Or, Let the Driver Choose (Local Work Size == NULL)
•Avoid Host-to-Device Transfers - Create Buffers with CL_MEM_USE_HOST_PTR
•Access Device Memory Efficiently - Minimize Cache Lines for __global Memory - Minimize Bank Conflicts for __local Memory
•Maximize Compute - Avoid Divergent Branches - Use mad / fma and float Data When Possible
48
© Copyright Khronos Group, 2013 - Page 49
Questions / Acknowledgements
•This presentation would not have been possible
without material and review comments from many
people – Thank you!
•Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom
Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek,
Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun
Krisch, Berna Adalier
49