Download - OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 -

OpenCL on Intel® Iris™ Graphics

SIGGRAPH 2013 July 2013

Presenter: Adam Lake

Content: Ben Ashbaugh, Arnon Peleg, others


Agenda

•Intel® Iris™ Graphics Architecture Overview

•Tools to help get the most from OpenCL on Intel CPU

and GPU

•Additional Resources

2


Intel® Iris™ Graphics Architecture Overview

3

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU



4

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Global Assets • Command Streamer • Thread Dispatch



5

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Sub Slice • Execution Units • Samplers and Data Port • Instruction and Texture Caches



6

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice Common • L3 Cache • Shared Local Memory • Barriers


Intel® Iris™ Graphics Architecture Building Blocks

• OpenCL* Kernels run on an Execution Unit (EU)

• Each EU is a Multi-Threaded SIMD Processor

• Up to 7 threads per EU

- 128 x 8 x 32-bit registers per thread

• Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled)

- “SIMD8”, “SIMD16”, “SIMD32”

- SIMD8 More Registers

- SIMD16 and SIMD32 Better Efficiency

7

EUThread 0

Thread 2

Thread 4

Thread 6

Thread 1`

Thread 3

Thread 5



• OpenCL* Work Groups run on a

Sub Slice

- 10 EUs per Sub Slice

- Texture Sampler (Images)

- Data Port (Buffers)

- Instruction and Texture Caches

8

OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU


Slice


DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU


• Two Sub Slices make a Slice

• Shared Resources: “Slice Common”

- L3 Cache + Shared Local Memory

- Barriers

• Intel® Iris™ Graphics has Two Slices

• How many work items in flight at once?

- 2 slides each with 2 subslices = 4 Sub Slices

- 4 subslices x 10 EUs/subslice = 40 EUs

- Up to 40 EUs x 7 threads/EU = 280 EU threads

- Up to 8960 OpenCL* work items in flight!

9


EU

Sampler L1

L3

Sampler L2

LLC DRAM

images

buffers

256KB/slice 2-8MB/package

(shared w/ CPU)

EDRAM (non-inclusive victim cache)

128MB/package

(Intel® Iris™ Pro 5200)

Intel® Iris™ Graphics Cache Hierarchy

10


Tools and Additional Resources


Occupancy – Intel® VTune™ Amplifier XE 2013

12


Additional Resources • Intel® SDK for OpenCL* Applications 2013

- Offline kernel builder, Debugger for CPU in MSVC

• Intel® OpenCL* Optimization Guide

• Intel® VTune™ Amplifier XE 2013

• Intel® Graphics Performance Analyzers - OpenCL Tuning

• Intel Linux Graphics hardware Bspec: https://01.org/linuxgraphics/

• SIGGRAPH 2013: - Faster, Better Pixels on the Go and in the Cloud with OpenCL* on Intel® Architecture

- Arnon Peleg - Optimizing OpenCL* Applications for Intel® Iris™ Graphics

- Ben Ashbaugh - Backup of this deck contains more from Ben Ashbaugh’s excellent SIGGRAPH 2012 talk on

optimizing for Intel® Iris™ Graphics!

13

http://software.intel.com/en-us/vcsource/tools/opencl-sdk

http://software.intel.com/sites/products/documentation/ioclsdk/2013/OG/index.htm





http://software.intel.com/en-us/intel-vtune-amplifier-xe



http://software.intel.com/en-us/vcsource/tools/intel-gpa

https://01.org/linuxgraphics/




Questions


INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,

BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE

FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS

OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,

MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE

INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND

THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF,

DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR

NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current

characterized errata are available on request.

Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third

parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole

risk of the user.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product

roadmaps.

Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as

SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when

combined with other products. For more information go to

http://www.Intel.com/performance

Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, and Ultrabook are trademarks of Intel Corporation in the United States and other countries

Legal


backup


Scheduling EUs for maximum occupancy


Occupancy

•Goal: Use All Execution Unit Resources

•This is harder than it sounds! Many factors to

consider… - Launch Enough Work

- More EU threads means better latency coverage to keep

an EU active - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled

18


Occupancy

•Goal: Use All Execution Unit Resources - Don’t Waste SIMD Lanes

- Use an optimal Local Work Size

- Good: Query for compiled SIMD size:

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

- Occasionally Helpful: Compile for a specific local work size

(8, 16, or 32):

__attribute__((reqd_work_group_size(X, Y, Z)))

- Best: Let the driver pick (Local Work Size == NULL)

Ideal for kernels with no barriers or shared local memory

19


Occupancy (continued…)

•Barriers - 16 barriers per sub slice

- Can be a limiting factor for very small local work groups

•Shared Local Memory - 64KB shared local memory per sub slice

- Can be a limiting factor for kernels that use lots of shared

local memory

20


Optimizing OpenCL* Applications

for Intel® Iris™ Graphics Ben Ashbaugh (Intel)


Agenda

•Understanding Occupancy - How Intel® Iris™ Graphics executes OpenCL* Kernels

•-

-

•-

•

22


How Intel® Iris™ Graphics Runs OpenCL*



24

1. Divide Into Work Groups



25

2. Divide Each Work Group Into EU Threads


Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU


26

3. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait



27

• 1. Divide Into Work Groups



28

•2. Divide Each Work Group Into EU Threads


Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU


29

•2. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait


Occupancy

•Goal: Use All Machine Resources

•This is harder than it sounds! Many factors to

consider…

1. Launch Enough Work - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled - More EU threads better latency coverage keeps an EU

active

30


Occupancy – Intel® VTune™ Amplifier XE 2013

31


Agenda

•-

•Memory Matters - Host to Device

-

•-

32


Optimizing Host to Device Transfers

33

•Host (CPU) and Device (GPU) share the same physical

memory

•For OpenCL* buffers: - No transfer needed (zero copy)!

- Allocate system memory aligned to a cache line (64 bytes)

- Create buffer with system memory pointer and

CL_MEM_USE_HOST_PTR

- Use clEnqueueMapBuffer() to access data

•For OpenCL* images: - Currently tiled in device memory transfer required


Operating on Buffers as Images

34

•Intel® Iris™ Graphics supports

cl_khr_image2d_from_buffer

- New OpenCL* 1.2 Extension

- Treat data as a buffer for some kernels, as an image for others

- Some restrictions for zero copy: buffer size, image pitch

0x123 0x456 0x789

Buffer Image


Interop with Graphics and Media APIs / SDKs

35

•Intel® Iris™ Graphics supports many Graphics and

Media interop extensions: - cl_khr_dx9_media_sharing (includes DXVA for Intel® Media SDK)

- cl_khr_d3d10_sharing

- cl_khr_d3d11_sharing

- cl_khr_gl_sharing

- cl_khr_gl_depth_images

- cl_khr_gl_event

- cl_khr_gl_msaa_sharing

Use Graphics API / SDK assets in OpenCL* with no copies!


Agenda

•-

•Memory Matters -

- Device Access

•-

36


EU

Sampler L1

L3

Sampler L2

LLC DRAM

images

buffers

256KB/slice 2-8MB/package

(shared w/ CPU)

EDRAM (non-inclusive victim cache)

128MB/package

(Intel® Iris™ Pro 5200)

Intel® Iris™ Graphics Cache Hierarchy

37


__global and __constant Memory • Global Memory Accesses go through the L3 Cache

• L3 Cache Line is 64 bytes

• EU thread accesses to the same L3 Cache Line are

collapsed

- Order of data within cache line does not matter

- Bandwidth determined by number of cache lines

accessed

- Maximum Bandwidth: 64 bytes / clock / sub slice

• Good: Load at least 32-bits of data at a time,

starting from a 32-bit aligned address

• Best: Load 4 x 32-bits of data at a time, starting

from a cache line aligned address

- Loading more than 4 x 32-bits of data is not beneficial

38


Global and Constant Memory Access Examples

1. x = data[ get_global_id(0) ]

- One cache line, full bandwidth

2. x = data[ n – get_global_id(0) ]

- Reverse order, full bandwidth

3. x = data[ get_global_id(0) + 1 ]

- Offset, two cache lines, half bandwidth

4. x = data[ get_global_id(0) * 2 ]

- Strided, half bandwidth


- Very strided, worst-case

39

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n - 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cache Line n

Global ID:

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n

0 1 2 ...

Cache Line n + 1 Cache Line n + 2 ...

Global ID:


__local Memory Accesses • Local Memory Accesses also go through the

L3 Cache!

• Key Difference: Local Memory is Banked

- Banked at a DWORD granularity, 16 banks

- Bandwidth determined by number of bank

conflicts

- Maximum Bandwidth: Still 64 bytes / clock /

sub slice

• Supports more access patterns with full

bandwidth than Global Memory

- No bank conflicts full bandwidth

- Reading from the same address in a bank

full bandwidth

40


Local Memory Access Examples

1. x = data[ get_global_id(0) + 1 ]

- Unique banks, full bandwidth

2. x = data[ get_global_id(0) & ~1 ]

- Same address read, full bandwidth


- Strided, half bandwidth


- Very strided, worst-case


- Full bandwidth!

41

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6

0

...

0 0 0 0

Bank:

0 0 0

Global ID:

... ... ... ... ... ... ...

7

0 1 2 3 4 5 6

0

...

1 2 3 4

Bank:

5 6

Global ID:

... ... ... ... ... ...


__private Memory

• Compiler can usually allocate Private

Memory in the Register File

- Even if Private Memory is dynamically indexed

- Good Performance

• Fallback: Private Memory allocated in

Global Memory

- Accesses are very strided

- Bad Performance

__private int

a[100]

…

…

EU

Th

read

n-1

EU

Th

read

n

EU

Th

read

n+

1

Work Item 0

Work Item 1

…

Work Item n

__private int

b[100]

__private int

c[200]

Work Item 0

Work Item 1

…

Work Item n

Work Item 0

Work Item 1

…

Work Item n


Agenda

•-

•-

-

•Compute Characteristics - Maximizing GFlops

43


ISA

•“SIMT” ISA with Predication and Branching

•“Divergent” code executes both branches Reduced

SIMD Efficiency

44

this();

if ( x )

that();

else

another();

finish();

SIMD lane

time

Example: “x” sometimes true

SIMD lane tim

e

Example: “x” never true


Compute GFlops

45

•EUs have 2 x 4-wide vector ALUs

•Second ALU has limitations: - Subset of instructions: add, mov, mad, mul, cmp

- Instruction must come from another EU thread

- Only float operands!

•Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL +

ADD ) x Clock Rate

For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!

Add Intel® Core™ Host Processor >1TFlop!


Maximizing Compute Performance

46

•Use mad() / fma(): Either explicitly with built-ins, or

via -cl-mad-enable

•Use floats wherever possible to maximize co-issue

•Avoid long and size_t data types - Prefer float over int, if possible

- Using short data types may improve performance

•Trade accuracy for speed: “native” built-ins, -cl-fast-

relaxed-math - Often good enough for graphics


Agenda

•-

•-

-

•-

Summary / Questions

47


Summary

•Maximize Occupancy - Choose a Good Local Work Size - Or, Let the Driver Choose (Local Work Size == NULL)

•Avoid Host-to-Device Transfers - Create Buffers with CL_MEM_USE_HOST_PTR

•Access Device Memory Efficiently - Minimize Cache Lines for __global Memory - Minimize Bank Conflicts for __local Memory

•Maximize Compute - Avoid Divergent Branches - Use mad / fma and float Data When Possible

48


Questions / Acknowledgements

•This presentation would not have been possible

without material and review comments from many

people – Thank you!

•Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom

Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek,

Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun

Krisch, Berna Adalier

49

Download - OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

Top Related