hc-4018, how to make the most of gpu accessible memory, by paul blinzer

BEING SPECIAL IN A UNIFIED MEMORY WORLD HOW TO MAKE THE MOST OF GPU ACCESSIBLE MEMORY

PAUL BLINZER FELLOW, SYSTEM SOFTWARE, AMD

2 | BEING SPECIAL IN A UNIFIED MEMORY WORLD | NOVEMBER 13, 2013 | APU13

THE AGENDA

!  What’s so special about dealing with memory and a GPU? ‒ The programmer’s view of memory ‒ Throwing a GPU into the mix ‒ How do today’s systems deal with GPU memory access?

!  The many different “types” of memory today and ways to access ‒ The various places to find and best use them ‒ What changes with HSA and hUMA? ‒ Why “buffered” view of memory is s[ll important and how to deal with it

!  Where to find more informa[on?

!  Q & A


WHAT’S SO SPECIAL ABOUT MEMORY ACCESS WITH A GPU?

Accelerated Processing Unit (APU)

CPU

1..N Compute Units

CoreMDC (L1)

CoreM-1DC (L1)

IC, FPU, L2

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

Memory (DDR3)

Cachednon-cacheable

HSA MMU (IOMMUv2)

L3

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

GPU

H-CU EngineLDS TU L1 (TC)




Global Data ShareInstruction Cache Constant Cache

L2 Cache

Memory Controller

LDS = Local Data ShareTU = Texture UnitTC = Texture Cache

Discrete GPU

Memory (GDDR5)

Mem

PCIE

GPU






L2 Cache

Memory Controller

Memory (DDR3)

Cachednon-cacheable

Mem

Mem

Memory (GDDR5) Memory (GDDR5)

Mem

Mem

THERE ARE SO MANY DIFFERENT TYPES, BUSES AND CACHES INVOLVED…


THE TYPICAL APPLICATION’S VIEW OF MEMORY (1)

!  Today’s opera[ng systems have an applica[on model based on a user process view of the system ‒ Each applica[on is associated with a process and the OS isolates the address space of one process from any other on the system, this is enforced by hardware (MMU = “Memory Management Unit”) ‒  Each CPU core may operate independently on a “thread” within that process

‒ The applica[on code has a “flat” view of memory, can allocate memory from the OS, write & read data at that address, etc. ‒  The address may be represented by a 32bit or 64bit (44/48bit) wide ptr value ‒  The memory content may not even be resident in physical memory, paged in from backup storage when accessed, maybe pushing other content out

‒  CPU caches keep an oien used “working set” of data close to the CPU core’s execu[on units

‒  CPU cache coherency mechanisms invalidate cache content when “outside forces “ (typically other CPU cores) update the content of system memory at a given address, ensuring that each CPU core sees the same data

A “GEDANKENEXPERIMENT”, COMBINING EINSTEIN AND TRON: IMAGINE YOU ARE A CPU CORE EXECUTING AN APPLICATION THREAD, ACCESSING DATA…

Process VASpace (CPU)

Non

cano

nical V

A Ra

nge

Allocatio

n

0x00000000

System Physical Memory Space

Page

Page

Page

Page

Page

Page

264-‐1

248-‐1

0x12340000

Mapped viaCPU MMU

Managed by OS

Kernel

Mod

e Ad

dress

Space

User P

rocess

Space

247-‐1

244-‐1

0x00000000

Process1Process20x00000000

247-‐1

Page

Page

GPU

Buffe

r

0x78900000

FBAp

erture



!  GPUs are typically managed as devices by opera[ng systems: ‒ They can only access physical memory pages as far as the OS memory management is concerned, though GPU may use “virtual addresses” ‒ GPU accessible system memory is “page-‐locked” and can’t move while the memory may be accessible by the GPU, even if it’s currently not used at all

‒  The total amount of memory a GPU can access at a [me is limited to the amount of page-‐locked memory or frame buffer memory

!  GPU accessible memory alloca[ons are handled via special APIs (DirectX, OpenGL,OpenCL, etc) ‒ CreateResource(), CreateBuffer(), CreateTexture()…

‒  The memory is managed as single objects (buffers, resources, textures, …), typically, “malloc()-‐ed” memory is typically not directly accessible by GPU

‒ The API typically only provides a “handle” referencing the object ‒  To access the memory content (all or part of it), an API provides func[ons like MapResourceView(), Lock(), Unlock() or similar ,establishing “windows” in the address space to that memory either to GPU or CPU or put into staging buffers

‒  Consider the resource “handle value + offset” as just a special kind of “address” outside of the regular process address space ☺

NOW LET’S SEE HOW A GPU SEES THAT SAME MEMORY TODAY AND ADDS TO IT…

GPU

GPUVirtual Address Space

Fram

ebuffer

GPU physicalmemory

(e.g. discrete)

Alloc.

Gfx

Page

0x56780000

Page

Mapped viaGPU MMU

GPU

Buffe

r

Managed byGfx Driver


Non

cano

nical V

A Ra

nge

Allocatio

n

0x00000000


Page

Page

Page

Page

Page

Page

264-‐1

248-‐1

0x12340000

Mapped viaCPU MMU

Managed by OS

Kernel

Mod

e Ad

dress

Space

User P

rocess

Space

247-‐1

244-‐1

0x00000000


247-‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture



‒ The good thing about an API controlled access is that the OS and & driver can copy the content to someplace else and/or to a different format ‒ where it can be more efficiently stored or processed (e.g. 2D [ling)

‒ The bad thing about it is that it’s an either/or style access ‒  For frequent accesses from both CPU & GPU, the transla[on can be tediously slow

‒  Content that can be accessed by both CPU and GPU simultaneously needs data visibility/coherency rules leading to the next issue…

!  Data visibility (cache coherency) is typically soiware-‐managed ‒ CPU cache coherency, when accessing system memory poten[ally updated by a GPU may not be always guaranteed ‒  depending on the system configura[on (e.g. PCIe bus access)

‒ GPU caches are typically explicitly managed by the driver and need to be refreshed when the CPU updates memory content

‒ One reason is hardware complexity to make this performant ‒ Depending on use scenario, the GPU accessible memory is mapped as “writethrough”, “uncached” or “writecombined” by the OS APIs

NOW LET’S SEE HOW A GPU SEES THAT SAME MEMORY TODAY AND ADDS TO IT…

GPU


Fram

ebuffer

GPU physicalmemory

(e.g. discrete)

Alloc.

Gfx

Page

0x56780000

Page

Mapped viaGPU MMU

GPU

Buffe

r

Managed byGfx Driver


Non

cano

nical V

A Ra

nge

Allocatio

n

0x00000000


Page

Page

Page

Page

Page

Page

264-‐1

248-‐1

0x12340000

Mapped viaCPU MMU

Managed by OS

Kernel

Mod

e Ad

dress

Space

User P

rocess

Space

247-‐1

244-‐1

0x00000000


247-‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture


IT’S ALL ABOUT THROUGHPUT, BANDWIDTH AND LATENCY… KEEP YOUR DATA CLOSE AND YOUR FREQUENTLY USED DATA EVEN CLOSER…

Accelerated Processing Unit (APU)

CPU

1..N Compute Units

CoreMDC (L1)

CoreM-1DC (L1)

IC, FPU, L2

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

Memory (DDR3)

Cachednon-cacheable

IOMMUv2

L3

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

GPU






L2 Cache

Memory Controller

LDS = Local Data ShareTU = Texture UnitTC = Texture Cache

Discrete GPU

Memory (GDDR5)

Mem

PCIE

GPU






L2 Cache

Memory Controller

Memory (DDR3)

Cachednon-cacheable

Mem

Mem

Memory (GDDR5) Memory (GDDR5)

Mem

Mem

~17 GB/s(DDR3-2133)

~17 GB/s(DDR3-2133)

~15 GB/sX16 PCI-E 3.0

~90GB/s(3GHz MCLK)

~90GB/s(3GHz MCLK)

~90GB/s(3GHz MCLK)Latency: 10's-100's of cycles

(mem, bus)

Bandwidth: 100's-1000's GB/sLatency: <1-10's cycles

Bandwidth: 100's GB/sLatency: <1-10's cyclesCaches:


IT’S ALL ABOUT THE RIGHT TOOL FOR THE JOB(1)

!  The efficient use of a GPU & CPU in a system depends understanding their opera[on on memory ‒ The cache architecture on either CPU and GPU is a reflec[on of the different access pauerns for their “preferred” workloads and data and so is the cache management/op[miza[on

!  CPU’s are typically built to operate on general purpose, serial instruc[on threads, oien high data locality, lot’s of condi[onal execu[on and dealing with data interdependency ‒ CPU cache hierarchy is focused on general purpose data access from/to execu[on units, feeding back previously computed data to the execu[on units with very low latency

‒ Compara[vely few registers (vs GPUs), but large caches keep oien used “arbitrary” data close to the execu[on units

!  GPUs are usually built for a SIMD execu[on model ‒ Apply the same sequence of instruc[ons over and over on data with liule varia[on but high throughput (“streaming data”), passing the data from one processing stage to another (latency tolerance)

‒ Compute units have a rela[vely large register file store ‒ Using a lot of “specialty caches” (constant cache, Texture Cache, etc), data caches op[mized for SW data prefetch

‒  LDS, GDS mainly used for in-‐wavefront or inter-‐wavefront updates & synchroniza[on ‒ Data caches are typically explicitly flushed by soiware



!  The GPU memory & cache access design is well-‐suited for typical 2D & 3D graphics workloads (duh!) ‒ Ver[ces data, Textures, etc are passed from the host to the various stages of the graphics API pipeline, with each stage allowing processing of the data passing through via appropriate instruc[on sequences (“shaders”)

‒  Since a lot of the data is “sta[c” and the access is abstracted via APIs, it can be put into beuer suited data formats mapping 2D/3D pixel coordinates “locality” to memory locality in internal buffers within the graphics pipeline ‒  Very beneficial for performance, but not easily “accessible” by simple addressing schemes, requires copy of the data first

‒ Today’s graphics APIs (OpenGL, Direct3D are well suited for this workload, but oien must focus on the lowest-‐common denominator in hardware capabili[es

‒ The API design assumes that no cache coherency between CPU and GPU may exist, requiring the CPU to issue explicit cache flushes or operate on memory areas mapped as “uncached” if readback of GPU data is required ‒  Some extensions or recently introduced features for “zero copy” memory

2D Tiling X-‐Coordinate

Y-‐Co

ordinate 16x16 ...16x16 16x16 16x1616x16

X0,Y0 X1,Y0 X2,Y0 X15,Y0 X0,Y1 X1,Y1 X2,Y1...Memory Addresses

... X15,Y14 X0,Y15 X1,Y15 X2,Y15 ...



!  Vector/Matrix-‐oriented compute workloads map well to GPUs, but un[l now “suffer” from some of the choices that benefit the graphics data processing flow ‒ Compute APIs like OpenCL™ or DirectCompute are oien s[ll inherently [ed to the low-‐level graphics focused GPU infrastructure in today’s OS (e.g. memory management through Microsoi® WDDM, Linux® TTM/GEM)

‒  “Zero Copy” Support and system memory buffer cache coherency in recent API’s improves the behavior on some pla{orms that have appropriate support, s[ll has some SW overhead for access

‒ All the memory processed by the GPU is referenced through handles to control memory page-‐lock on workload dispatch and the SW needs to create “Buffer views” either explicitly or under the covers to access regular memory ‒  There is quite some SW overhead involved in that

!  Discrete GPU have excellent compute performance (several TeraFLOPS for even mid-‐range cards) ‒ But require the data to be accessible in local memory for best performance, requiring copy-‐opera[ons from host memory and “keeping the data on the other side” as long as possible

‒ Accessing or pushing the data back and forth through the PCIe bouleneck may reduce or eliminate speedup-‐gains or increases access latency from host substan[ally


HOW DOES HUMA AND HSA CHANGE THINGS ?

!  First, let’s redraw the address layout map from before… ‒  It’s the same layout, just a different visualiza[on (focus on bit47 ☺) ‒ There is efficient hardware support for GPU & CPU cache coherency on memory load/store opera[ons by the GPU ‒  Reads and updates of system memory from one will cause cache line flushes or line invalida[on on the other processors in the system

‒  SW does not have to deal with explicit cache line flushes or invalida[ons for such transac[ons anymore, it works like for any CPU core in the system

‒  This fully works for APUs, where GPU and CPU have access to the same system memory controller, par[al support for discrete GPU

‒ The GPU’s virtual address page table mapping is set to a process address view of the memory space ‒  A data pointer has the same “meaning” (= points to the same content) in system memory (also known as “ptr-‐is-‐ptr”)

‒ On OS that support HSA MMU func[onality, the page tables may be even shared and the OS may support na[ve GPU demand paging

‒  The GPU may s[ll support addi[onal address ranges for special purposes (e.g. frame buffer memory, LDS, scratch, …)

‒ Pla{orm atomics are supported, for efficient synchroniza[on

GPU


Fram

ebuffer

GPU physicalmemory

(e.g. discrete)

Alloc.

Gfx

PagePage

Mapped viaHSA MMU

Managed byOS & Gfx driver


Non

cano

nical V

A Ra

nge

Allocatio

n

0x00000000


Page

Page

Page

Page

Page

Page

264-‐1

0x12340000

Mapped viaCPU MMU

Managed by OS

Kernel

Mod

e Ad

dress

Space

User P

rocess

Space

247-‐1

244-‐1

0x00000000


247-‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture

Allocatio

n

0x12340000

User P

rocess

Space

247-‐1

Process1 0x00000000

247-‐1

GPU

Buffe

r

0x78900000

FBAp

erture

Process2


THERE ARE STILL REASONS FOR THE “BUFFERED VIEW” OF MEMORY

!  HSA and hUMA are very useful for compute jobs and graphics data oien updated by the host CPU ‒ Allows fine-‐grained “interac[ve” sharing of data between CPU and GPU threads without requiring prophylac[c cache flushes and other synchroniza[on

!  But the “direct view” and access to common memory is less beneficial for other graphics data ‒ Many graphics algorithms have been designed with an “abstract” or “deferred” view of memory, focusing on “dimensional addressing” of the data in the shaders (e.g. x/y/z, u/w coordinates)

‒ Many GPUs use hardware-‐specific texture [ling formats that are op[mized for a specific memory channel layout to reach maximum performance, complicated to address by soiware in a general way

‒ An applica[on may have mul[ple graphics contexts concurrently per process (per API), vs just one for “flat” ‒ A lot of graphics data (e.g. textures, ver[ces, et al) are not changing oien through CPU updates

‒  requiring cache coherency increases HW access overhead for liule benefit ‒ Many specialty resources (e.g. Z-‐Buffer) have GPU-‐specific implementa[on with no “external” visibility

‒  Leveraging the much higher performance of a Discrete GPU and its frame buffer memory is somewhat more complicated, if an applica[on needs to deal with the memory loca[on directly

!  Most common graphics APIs today don’t know how to deal with virtual addresses ‒ This will change in the future as u[lizing virtual addresses within graphics APIs becomes commonplace


GRAPHICS INTEROPERATION IS IMPORTANT

!  There are many different graphics/GPU APIs in use, using buffers/resources to access memory ‒ As seen before, there are good reasons to keep the content in “buffers” either due to legacy or performance ‒  It also may not make sense to “waste” virtual address space e.g. on 32bit apps for resources not accessed by host ‒ But this may also makes it harder to access the content from either CPU or through a “flat addressing” aware GPU

!  Explicit interopera[on APIs to tradi[onal graphics APIs provides two views of a resource ‒ The transla[on between “handle + offset” and “flat address” is dealt within the run[me and driver ‒ The transla[on itself may be straigh{orward and very efficient however

!  Specialty GPU resources (e.g. LDS, scratch) may be mapped into the “flat” process address space, but may not be accessible by the CPU host since they’re not accessible from the “outside” ‒ This is no different than some other system memory mappings provided by the OS

!  Applica[ons should focus on an efficient processing of the data on the “compute” with a dedicated handover to the “graphics” side when appropriate ‒ As graphics APIs are updated over [me to take advantage of flat addressing models (e.g. for “bindless textures”) the need for the interopera[on mechanisms may gradually vanish for most graphics data


ADDITIONAL CONSIDERATIONS

!  A lot of today’s PC systems have more than one GPU available to the programmer ‒ Almost all of todays CPUs are actually APUs and have both CPU and GPU on chip, using the same memory controller ‒ On performance systems, a discrete GPU with dedicated frame buffer memory may be present, too

!  The integrated GPU may support cache coherency for system memory updates and therefore is preferen[al for GPU compute tasks via e.g. DirectCompute or OpenCL™ ‒ The performance uplii vs CPU may differ, but typically there oien is a >10 [mes factor for vector computa[ons vs equivalent CPU instruc[ons

!  Discrete GPU can focus on graphics workload accelera[on, further processing the data pre-‐processed by either host CPU or integrated GPU for further uplii ‒ Dedicated transfer from/to discrete GPU frame buffer ‒  For appropriate compute workloads, consider the addi[onal performance uplii through compute on discrete GPU

!  The controls may be in a driver as part of collabora[ve render (e.g. AMD DualGraphics) where the compute processing on the integrated GPU via appropriate APIs interoperates with the “graphics” device ‒ The graphics driver operates in a “Crossfire” mode for integrated and discrete GPU ‒ Whereas the compute device operates on a DirectCompute or OpenCL™ “device” on the integrated GPU


SUMMARY

!  HSA and hUMA substan[ally simplifies data exchange between GPU and CPU, processing it on both sides ‒ Benefits from a flat address model where data pointer references to content can be resolved on either side ‒  It works best for compute-‐heavy workload, where frequent data updates and result retrieval is important

!  There are s[ll benefits to keep some graphics data in a “buffered” address mode through graphics APIs ‒  Leverages “specialty caches”, discrete GPU and storage within the GPU that is op[mized for graphics data but makes it “less accessible” for CPU host access

!  With appropriate, efficient interopera[on between the “buffered” and the “flat” resource view on the GPU the applica[on can easily traverse between these two data representa[ons ‒ An HSA compliant GPU allows for a very efficient transla[on between these two representa[ons ‒ Current compute & graphics API’s can be supported in this scheme ‒ With na[ve support for a “flat model” in upcoming modern OS, direct, “flat”, cache coherent references to memory resources will become easier to use directly over [me, reducing the need for explicit transla[on

!  Take advantage of all the GPUs and all the memory you find on a system! ‒ There’s oien more than one and all have their advantages


WHERE TO FIND MORE INFORMATION

!  AMD Accelerated Parallel Processing (APP) SDK: ‒  hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/amd-‐accelerated-‐parallel-‐processing-‐app-‐sdk/ ‒ AMD APP SDK is a complete development pla{orm, providing samples, documenta[on and other materials to quickly get you started using OpenCL™, Bolt (Open Source C++ Template Library for GPU parallel processing), C++AMP or Aparapi for Java applica[ons

!  AMD CodeXL: ‒ hup://developer.amd.com/tools-‐and-‐sdks/heterogeneous-‐compu[ng/codexl/ ‒ A powerful tools suite for Windows® and Linux® heterogeneous applica[on debugging and profiling ‒ Works standalone and e.g. Integrated as Visual Studio extension

!  AMD Developer Central: hup://developer.amd.com ‒ Docs, whitepapers, tools; Everything you want to know and need to write performant programs on heterogeneous systems. ‒  It’s not about either CPU or GPU, its about both…

THIS PRESENTATION IS ONLY A START…


GO AHEAD ☺


DISCLAIMER & ATTRIBUTION

The informa[on presented in this document is for informa[onal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informa[on contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, soiware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obliga[on to update or otherwise correct or revise this informa[on. However, AMD reserves the right to revise this informa[on and to make changes from [me to [me to the content hereof without obliga[on of AMD to no[fy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combina[ons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdic[ons. OpenCL is a trademark of Apple Corp. and Linux is a trademark of Linus Torvalds and Microsoi is a trademark of Microsoi Corp. PCI Express is a trademark of PCI SIG Corpora[on. Other names are for informa[onal purposes only and may be trademarks of their respec[ve owners.

hc-4018, how to make the most of gpu accessible memory, by paul blinzer

Technology