hc-4018, how to make the most of gpu accessible memory, by paul blinzer

18
BEING SPECIAL IN A UNIFIED MEMORY WORLD HOW TO MAKE THE MOST OF GPU ACCESSIBLE MEMORY PAUL BLINZER FELLOW, SYSTEM SOFTWARE, AMD

Upload: amd-developer-central

Post on 06-May-2015

908 views

Category:

Technology


1 download

DESCRIPTION

Presentation HC-4018 by Paul Blinzer at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD    HOW  TO  MAKE  THE  MOST  OF  GPU  ACCESSIBLE  MEMORY  

PAUL  BLINZER  FELLOW,  SYSTEM  SOFTWARE,  AMD  

Page 2: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

2   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

THE  AGENDA  

!  What’s  so  special  about  dealing  with  memory  and  a  GPU?  ‒ The  programmer’s  view  of  memory  ‒ Throwing  a  GPU  into  the  mix  ‒ How  do  today’s  systems  deal  with  GPU  memory  access?  

!  The  many  different  “types”  of  memory  today  and  ways  to  access  ‒ The  various  places  to  find  and  best  use  them  ‒ What  changes  with  HSA  and  hUMA?  ‒ Why  “buffered”  view  of  memory  is  s[ll  important  and  how  to  deal  with  it  

!  Where  to  find  more  informa[on?  

!  Q  &  A  

Page 3: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

3   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

WHAT’S  SO  SPECIAL  ABOUT  MEMORY  ACCESS  WITH  A  GPU?  

Accelerated Processing Unit (APU)

CPU

1..N Compute Units

CoreMDC (L1)

CoreM-1DC (L1)

IC, FPU, L2

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

Memory (DDR3)

Cachednon-cacheable

HSA MMU (IOMMUv2)

L3

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

GPU

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

Global Data ShareInstruction Cache Constant Cache

L2 Cache

Memory Controller

LDS = Local Data ShareTU = Texture UnitTC = Texture Cache

Discrete GPU

Memory (GDDR5)

Mem

PCIE

GPU

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

Global Data ShareInstruction Cache Constant Cache

L2 Cache

Memory Controller

Memory (DDR3)

Cachednon-cacheable

Mem

Mem

Memory (GDDR5) Memory (GDDR5)

Mem

Mem

THERE  ARE  SO  MANY  DIFFERENT  TYPES,  BUSES  AND  CACHES  INVOLVED…  

Page 4: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

4   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (1)  

!  Today’s  opera[ng  systems  have  an  applica[on  model  based  on  a  user  process  view  of  the  system  ‒ Each  applica[on  is  associated  with  a  process  and  the  OS  isolates  the  address  space  of  one  process  from  any  other  on  the  system,  this  is  enforced  by  hardware  (MMU  =  “Memory  Management  Unit”)  ‒  Each  CPU  core  may  operate  independently  on  a  “thread”  within  that  process  

‒ The  applica[on  code  has  a  “flat”  view  of  memory,  can  allocate  memory  from  the  OS,  write  &  read  data  at  that  address,  etc.  ‒  The  address  may  be  represented  by  a  32bit  or  64bit  (44/48bit)  wide  ptr  value    ‒  The  memory  content  may  not  even  be  resident  in  physical  memory,  paged  in  from  backup  storage  when  accessed,  maybe  pushing    other  content  out  

‒  CPU  caches  keep  an  oien  used  “working  set”  of  data  close  to  the  CPU  core’s  execu[on  units  

‒  CPU  cache  coherency  mechanisms    invalidate  cache  content  when  “outside  forces  “  (typically  other  CPU  cores)  update  the  content  of  system  memory  at  a  given    address,  ensuring  that  each  CPU  core  sees  the  same  data  

A  “GEDANKENEXPERIMENT”,  COMBINING  EINSTEIN  AND  TRON:  IMAGINE  YOU  ARE  A  CPU  CORE  EXECUTING  AN  APPLICATION  THREAD,  ACCESSING  DATA…  

Process  VASpace  (CPU)

Non

cano

nical  V

A  Ra

nge

Allocatio

n

0x00000000

System  Physical  Memory  Space

Page

Page

Page

Page

Page

Page

264-­‐1

248-­‐1

0x12340000

Mapped  viaCPU  MMU

Managed  by  OS

Kernel  

Mod

e  Ad

dress

Space  

User  P

rocess

 Space

247-­‐1

244-­‐1

0x00000000

Process1Process20x00000000

247-­‐1

Page

Page

GPU

Buffe

r

0x78900000

FBAp

erture

Page 5: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

5   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (2)  

!  GPUs  are  typically  managed  as  devices  by  opera[ng  systems:  ‒ They  can  only  access  physical  memory  pages  as  far  as  the  OS  memory  management  is  concerned,  though  GPU  may  use  “virtual  addresses”  ‒ GPU  accessible  system  memory  is  “page-­‐locked”  and  can’t  move  while  the  memory  may  be  accessible  by  the  GPU,  even  if  it’s  currently  not  used  at  all  

‒  The  total  amount  of  memory  a  GPU  can  access  at  a  [me  is  limited  to  the  amount  of  page-­‐locked  memory  or  frame  buffer  memory    

!  GPU  accessible  memory  alloca[ons  are  handled  via  special  APIs  (DirectX,  OpenGL,OpenCL,  etc)  ‒ CreateResource(),  CreateBuffer(),  CreateTexture()…    

‒  The  memory  is  managed  as  single  objects  (buffers,  resources,  textures,  …),  typically,  “malloc()-­‐ed”  memory  is  typically  not  directly  accessible  by  GPU  

‒ The  API  typically  only  provides  a  “handle”  referencing  the  object  ‒  To  access  the  memory  content  (all  or  part  of  it),  an  API  provides  func[ons  like  MapResourceView(),  Lock(),  Unlock()  or  similar  ,establishing  “windows”  in  the  address  space  to  that  memory  either  to  GPU  or  CPU  or  put  into  staging  buffers  

‒  Consider  the  resource  “handle  value  +  offset”  as  just  a  special  kind  of  “address”  outside  of  the  regular  process  address  space  ☺  

 

NOW  LET’S  SEE  HOW  A  GPU  SEES  THAT  SAME  MEMORY  TODAY  AND  ADDS  TO  IT…  

GPU

GPUVirtual  Address  Space

Fram

ebuffer

GPU  physicalmemory

(e.g.  discrete)

Alloc.

Gfx

Page

0x56780000

Page

Mapped  viaGPU  MMU

GPU

Buffe

r

Managed  byGfx  Driver

Process  VASpace  (CPU)

Non

cano

nical  V

A  Ra

nge

Allocatio

n

0x00000000

System  Physical  Memory  Space

Page

Page

Page

Page

Page

Page

264-­‐1

248-­‐1

0x12340000

Mapped  viaCPU  MMU

Managed  by  OS

Kernel  

Mod

e  Ad

dress

Space  

User  P

rocess

 Space

247-­‐1

244-­‐1

0x00000000

Process1Process20x00000000

247-­‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture

Page 6: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

6   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

THE  TYPICAL  APPLICATION’S  VIEW  OF  MEMORY  (3)  

‒ The  good  thing  about  an  API  controlled  access  is  that  the  OS  and  &  driver  can  copy  the  content  to  someplace  else  and/or  to  a  different  format  ‒ where  it  can  be  more  efficiently  stored  or  processed  (e.g.  2D  [ling)  

‒ The  bad  thing  about  it  is  that  it’s  an  either/or  style  access  ‒  For  frequent  accesses  from  both  CPU  &  GPU,  the  transla[on  can  be  tediously  slow  

‒  Content  that  can  be  accessed  by  both  CPU  and  GPU  simultaneously  needs  data  visibility/coherency  rules  leading  to  the  next  issue…  

!  Data  visibility  (cache  coherency)  is  typically  soiware-­‐managed  ‒ CPU  cache  coherency,  when  accessing    system  memory  poten[ally  updated  by  a  GPU  may  not  be  always  guaranteed  ‒  depending  on  the  system  configura[on  (e.g.  PCIe  bus  access)  

‒ GPU  caches  are  typically    explicitly  managed  by  the  driver  and  need  to  be  refreshed  when  the  CPU  updates  memory  content  

‒ One  reason  is  hardware  complexity  to  make  this  performant  ‒ Depending  on  use  scenario,  the  GPU  accessible  memory  is  mapped  as  “writethrough”,  “uncached”  or  “writecombined”  by  the  OS  APIs  

NOW  LET’S  SEE  HOW  A  GPU  SEES  THAT  SAME  MEMORY  TODAY  AND  ADDS  TO  IT…  

GPU

GPUVirtual  Address  Space

Fram

ebuffer

GPU  physicalmemory

(e.g.  discrete)

Alloc.

Gfx

Page

0x56780000

Page

Mapped  viaGPU  MMU

GPU

Buffe

r

Managed  byGfx  Driver

Process  VASpace  (CPU)

Non

cano

nical  V

A  Ra

nge

Allocatio

n

0x00000000

System  Physical  Memory  Space

Page

Page

Page

Page

Page

Page

264-­‐1

248-­‐1

0x12340000

Mapped  viaCPU  MMU

Managed  by  OS

Kernel  

Mod

e  Ad

dress

Space  

User  P

rocess

 Space

247-­‐1

244-­‐1

0x00000000

Process1Process20x00000000

247-­‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture

Page 7: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

7   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

IT’S  ALL  ABOUT  THROUGHPUT,  BANDWIDTH  AND  LATENCY…  KEEP  YOUR  DATA  CLOSE  AND  YOUR  FREQUENTLY  USED  DATA  EVEN  CLOSER…  

Accelerated Processing Unit (APU)

CPU

1..N Compute Units

CoreMDC (L1)

CoreM-1DC (L1)

IC, FPU, L2

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

Memory (DDR3)

Cachednon-cacheable

IOMMUv2

L3

1..N Compute Units

Core1DC (L1)

Core0DC (L1)

IC, FPU, L2

GPU

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

Global Data ShareInstruction Cache Constant Cache

L2 Cache

Memory Controller

LDS = Local Data ShareTU = Texture UnitTC = Texture Cache

Discrete GPU

Memory (GDDR5)

Mem

PCIE

GPU

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

H-CU EngineLDS TU L1 (TC)

Global Data ShareInstruction Cache Constant Cache

L2 Cache

Memory Controller

Memory (DDR3)

Cachednon-cacheable

Mem

Mem

Memory (GDDR5) Memory (GDDR5)

Mem

Mem

~17 GB/s(DDR3-2133)

~17 GB/s(DDR3-2133)

~15 GB/sX16 PCI-E 3.0

~90GB/s(3GHz MCLK)

~90GB/s(3GHz MCLK)

~90GB/s(3GHz MCLK)Latency: 10's-100's of cycles

(mem, bus)

Bandwidth: 100's-1000's GB/sLatency: <1-10's cycles

Bandwidth: 100's GB/sLatency: <1-10's cyclesCaches:

Page 8: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

8   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(1)  

!  The  efficient  use  of  a  GPU  &  CPU  in  a  system  depends  understanding  their  opera[on  on  memory  ‒ The  cache  architecture  on  either  CPU  and  GPU  is  a  reflec[on  of  the  different  access  pauerns  for  their  “preferred”  workloads  and  data  and  so  is  the  cache  management/op[miza[on    

!  CPU’s  are  typically  built  to  operate  on  general  purpose,  serial  instruc[on  threads,  oien  high  data  locality,  lot’s  of  condi[onal  execu[on  and  dealing  with  data  interdependency  ‒ CPU  cache  hierarchy  is  focused  on  general  purpose  data  access  from/to  execu[on  units,  feeding  back  previously  computed  data  to  the  execu[on  units  with  very  low  latency  

‒ Compara[vely  few  registers  (vs  GPUs),  but  large  caches  keep  oien  used  “arbitrary”  data  close  to  the  execu[on  units    

!  GPUs  are  usually  built  for  a  SIMD  execu[on  model    ‒ Apply  the  same  sequence  of  instruc[ons  over  and  over  on  data  with  liule  varia[on  but  high  throughput  (“streaming  data”),  passing  the  data  from  one  processing  stage  to  another  (latency  tolerance)  

‒ Compute  units  have  a  rela[vely  large  register  file  store  ‒ Using  a  lot  of  “specialty  caches”  (constant  cache,  Texture  Cache,  etc),  data  caches  op[mized  for  SW  data  prefetch  

‒  LDS,  GDS  mainly  used  for  in-­‐wavefront  or  inter-­‐wavefront  updates  &  synchroniza[on  ‒ Data  caches  are  typically  explicitly  flushed  by  soiware  

Page 9: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

9   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(2)  

!  The  GPU  memory  &  cache  access    design  is  well-­‐suited  for  typical  2D  &  3D  graphics  workloads  (duh!)  ‒ Ver[ces  data,  Textures,  etc  are  passed  from  the  host  to  the  various  stages  of  the  graphics  API  pipeline,  with  each  stage  allowing  processing  of  the  data  passing  through  via  appropriate  instruc[on  sequences  (“shaders”)  

‒  Since  a  lot  of  the  data  is  “sta[c”  and  the  access  is  abstracted  via  APIs,  it  can  be  put  into  beuer  suited  data  formats  mapping  2D/3D  pixel  coordinates  “locality”  to  memory  locality  in  internal  buffers  within  the  graphics  pipeline    ‒  Very  beneficial  for  performance,  but  not  easily  “accessible”  by  simple  addressing  schemes,  requires  copy  of  the  data  first  

‒ Today’s  graphics  APIs  (OpenGL,  Direct3D  are  well  suited  for  this  workload,  but  oien  must  focus  on  the  lowest-­‐common  denominator  in  hardware  capabili[es  

‒ The  API  design  assumes  that  no  cache  coherency  between  CPU  and  GPU  may  exist,  requiring  the  CPU  to  issue  explicit  cache  flushes  or  operate  on  memory  areas  mapped  as  “uncached”  if  readback  of  GPU  data  is  required  ‒  Some  extensions  or  recently  introduced  features  for  “zero  copy”  memory  

2D  Tiling  X-­‐Coordinate

Y-­‐Co

ordinate 16x16 ...16x16 16x16 16x1616x16

X0,Y0 X1,Y0 X2,Y0 X15,Y0 X0,Y1 X1,Y1 X2,Y1...Memory  Addresses

... X15,Y14 X0,Y15 X1,Y15 X2,Y15 ...

Page 10: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

10   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

IT’S  ALL  ABOUT  THE  RIGHT  TOOL  FOR  THE  JOB(3)  

!  Vector/Matrix-­‐oriented  compute  workloads  map  well    to  GPUs,  but  un[l  now  “suffer”  from  some  of  the  choices  that  benefit  the  graphics  data  processing  flow  ‒ Compute  APIs  like  OpenCL™  or  DirectCompute  are  oien  s[ll  inherently  [ed  to  the  low-­‐level  graphics  focused  GPU  infrastructure  in  today’s  OS  (e.g.  memory  management  through  Microsoi®  WDDM,  Linux®  TTM/GEM)  

‒  “Zero  Copy”  Support  and  system  memory  buffer  cache  coherency  in  recent  API’s  improves  the  behavior  on  some  pla{orms  that  have  appropriate  support,  s[ll  has  some  SW  overhead  for  access  

‒ All  the  memory  processed  by  the  GPU  is  referenced  through  handles  to  control  memory  page-­‐lock  on  workload  dispatch  and  the  SW  needs  to  create  “Buffer  views”  either  explicitly  or  under  the  covers  to  access  regular  memory  ‒  There  is  quite  some  SW  overhead  involved  in  that  

!  Discrete  GPU  have  excellent  compute  performance  (several  TeraFLOPS  for  even  mid-­‐range  cards)  ‒ But  require  the  data  to  be  accessible  in  local  memory  for  best  performance,  requiring  copy-­‐opera[ons  from  host  memory  and  “keeping  the  data  on  the  other  side”  as  long  as  possible    

‒ Accessing  or  pushing  the  data  back  and  forth  through  the  PCIe  bouleneck  may  reduce  or  eliminate  speedup-­‐gains  or  increases  access  latency  from  host  substan[ally  

 

Page 11: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

11   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

HOW  DOES  HUMA    AND  HSA  CHANGE  THINGS  ?  

!  First,  let’s  redraw  the  address  layout  map  from  before…  ‒  It’s  the  same  layout,  just  a  different  visualiza[on  (focus  on  bit47  ☺)  ‒ There  is  efficient  hardware  support  for  GPU  &  CPU  cache  coherency  on  memory  load/store  opera[ons  by  the  GPU    ‒  Reads  and  updates  of  system  memory  from  one  will  cause    cache  line  flushes  or  line  invalida[on  on  the  other  processors  in  the  system  

‒  SW  does  not  have  to  deal  with  explicit  cache  line  flushes  or  invalida[ons  for  such  transac[ons  anymore,  it  works  like  for  any  CPU  core  in  the  system  

‒  This  fully  works  for  APUs,  where  GPU  and  CPU  have  access  to  the  same  system  memory  controller,  par[al  support  for  discrete  GPU  

‒ The  GPU’s  virtual  address  page  table  mapping  is  set  to  a  process  address  view  of  the  memory  space  ‒  A  data  pointer  has  the  same  “meaning”  (=  points  to  the  same  content)  in  system  memory  (also  known  as  “ptr-­‐is-­‐ptr”)  

‒ On  OS  that  support  HSA  MMU  func[onality,  the  page  tables    may  be  even  shared  and  the  OS  may  support  na[ve  GPU  demand    paging    

‒  The  GPU  may  s[ll  support  addi[onal  address  ranges  for  special  purposes    (e.g.  frame  buffer  memory,  LDS,  scratch,  …)    

‒ Pla{orm  atomics  are  supported,  for  efficient  synchroniza[on  

GPU

GPUVirtual  Address  Space

Fram

ebuffer

GPU  physicalmemory

(e.g.  discrete)

Alloc.

Gfx

PagePage

Mapped  viaHSA  MMU

Managed  byOS  &  Gfx  driver

Process  VASpace  (CPU)

Non

cano

nical  V

A  Ra

nge

Allocatio

n

0x00000000

System  Physical  Memory  Space

Page

Page

Page

Page

Page

Page

264-­‐1

0x12340000

Mapped  viaCPU  MMU

Managed  by  OS

Kernel  

Mod

e  Ad

dress

Space  

User  P

rocess

 Space

247-­‐1

244-­‐1

0x00000000

Process1Process20x00000000

247-­‐1

Page

Page

GPU

Buffe

r

0x78900000

0x98765000

0x00000000

0x00000000

FBAp

erture

Allocatio

n

0x12340000

User  P

rocess

 Space

247-­‐1

Process1 0x00000000

247-­‐1

GPU

Buffe

r

0x78900000

FBAp

erture

Process2

Page 12: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

12   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

THERE  ARE  STILL  REASONS  FOR  THE  “BUFFERED  VIEW”  OF  MEMORY  

!  HSA  and  hUMA  are  very  useful  for  compute  jobs  and  graphics  data  oien  updated  by  the  host  CPU  ‒ Allows  fine-­‐grained  “interac[ve”  sharing  of  data  between  CPU  and  GPU  threads  without  requiring  prophylac[c  cache  flushes  and  other  synchroniza[on  

!  But  the  “direct  view”  and  access  to  common  memory  is  less  beneficial  for  other  graphics  data  ‒ Many  graphics  algorithms  have  been  designed  with  an  “abstract”  or  “deferred”  view  of  memory,  focusing  on  “dimensional  addressing”  of  the  data  in  the  shaders  (e.g.  x/y/z,  u/w  coordinates)  

‒ Many  GPUs  use  hardware-­‐specific  texture  [ling  formats  that  are  op[mized  for  a  specific  memory  channel  layout  to  reach  maximum  performance,  complicated  to  address  by  soiware  in  a  general  way    

‒ An  applica[on  may  have  mul[ple  graphics  contexts  concurrently  per  process  (per  API),  vs  just  one  for  “flat”  ‒ A  lot  of  graphics  data  (e.g.  textures,  ver[ces,  et  al)  are  not  changing  oien  through  CPU  updates  

‒  requiring  cache  coherency  increases  HW  access  overhead  for  liule  benefit  ‒ Many  specialty  resources  (e.g.  Z-­‐Buffer)  have  GPU-­‐specific  implementa[on  with  no  “external”  visibility  

‒  Leveraging  the  much  higher  performance  of  a  Discrete  GPU  and  its  frame  buffer  memory  is  somewhat  more  complicated,  if  an  applica[on  needs  to  deal  with  the  memory  loca[on  directly  

!   Most  common  graphics  APIs  today  don’t  know  how  to  deal  with  virtual  addresses  ‒ This  will  change  in  the  future  as  u[lizing  virtual  addresses  within  graphics  APIs  becomes  commonplace  

Page 13: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

13   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

GRAPHICS  INTEROPERATION  IS  IMPORTANT  

!  There  are  many  different  graphics/GPU  APIs  in  use,  using  buffers/resources  to  access  memory  ‒ As  seen  before,  there  are  good  reasons  to  keep  the  content  in  “buffers”  either  due  to  legacy  or  performance  ‒  It  also  may  not  make  sense  to  “waste”  virtual  address  space  e.g.  on  32bit  apps  for  resources  not  accessed  by  host  ‒ But  this  may  also  makes  it  harder  to  access  the  content    from  either  CPU  or  through  a  “flat  addressing”  aware  GPU  

!  Explicit  interopera[on  APIs  to  tradi[onal  graphics  APIs  provides  two  views  of  a  resource  ‒ The  transla[on  between  “handle  +  offset”  and  “flat  address”  is  dealt  within  the  run[me  and  driver  ‒ The  transla[on  itself  may  be  straigh{orward  and  very  efficient  however  

!  Specialty  GPU  resources  (e.g.  LDS,  scratch)  may  be  mapped  into  the  “flat”  process  address  space,  but  may  not  be  accessible  by  the  CPU  host  since  they’re  not  accessible  from  the  “outside”  ‒ This  is  no  different  than  some  other  system  memory  mappings    provided  by  the  OS  

!  Applica[ons  should  focus  on  an  efficient  processing  of  the  data  on  the  “compute”  with  a  dedicated  handover  to  the  “graphics”  side  when  appropriate  ‒ As  graphics  APIs  are  updated  over  [me  to  take  advantage  of  flat  addressing  models  (e.g.  for  “bindless  textures”)  the  need  for  the  interopera[on  mechanisms  may  gradually  vanish  for  most  graphics  data  

Page 14: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

14   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

ADDITIONAL  CONSIDERATIONS  

!  A  lot  of  today’s  PC  systems  have  more  than  one  GPU  available  to  the  programmer  ‒ Almost  all  of  todays  CPUs  are  actually  APUs  and  have  both  CPU  and  GPU  on  chip,  using  the  same  memory  controller  ‒ On  performance  systems,  a  discrete  GPU  with  dedicated  frame  buffer  memory  may  be  present,  too  

!  The  integrated  GPU  may  support  cache  coherency  for  system  memory  updates  and  therefore  is  preferen[al  for  GPU  compute  tasks  via  e.g.  DirectCompute  or  OpenCL™  ‒ The  performance  uplii  vs  CPU  may  differ,  but  typically  there  oien  is  a  >10  [mes  factor  for  vector  computa[ons  vs  equivalent  CPU  instruc[ons  

!  Discrete  GPU  can  focus  on  graphics  workload  accelera[on,  further  processing  the  data  pre-­‐processed  by  either  host  CPU  or  integrated  GPU  for  further  uplii  ‒ Dedicated  transfer  from/to  discrete  GPU  frame  buffer  ‒  For  appropriate  compute  workloads,  consider  the  addi[onal  performance  uplii  through  compute  on  discrete  GPU  

!  The  controls  may  be  in  a  driver  as  part  of  collabora[ve  render  (e.g.  AMD  DualGraphics)  where  the  compute  processing  on  the  integrated  GPU  via  appropriate  APIs  interoperates  with  the  “graphics”  device  ‒ The  graphics  driver  operates  in  a  “Crossfire”  mode  for  integrated  and  discrete  GPU  ‒ Whereas  the  compute  device  operates  on  a  DirectCompute  or  OpenCL™  “device”  on  the  integrated  GPU  

Page 15: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

15   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

SUMMARY  

!  HSA  and  hUMA  substan[ally  simplifies  data  exchange  between  GPU  and  CPU,  processing  it  on  both  sides  ‒ Benefits  from  a  flat  address  model  where  data  pointer  references  to  content  can  be  resolved  on  either  side    ‒  It  works  best  for  compute-­‐heavy  workload,  where  frequent  data  updates  and  result  retrieval  is  important  

!  There  are  s[ll  benefits  to  keep  some  graphics  data  in  a  “buffered”  address  mode  through  graphics  APIs  ‒  Leverages  “specialty  caches”,  discrete  GPU  and  storage  within  the  GPU  that  is  op[mized  for  graphics  data  but  makes  it  “less  accessible”  for  CPU  host  access  

!  With  appropriate,  efficient  interopera[on  between  the  “buffered”  and  the  “flat”  resource  view  on  the  GPU  the  applica[on  can  easily  traverse  between  these  two  data  representa[ons  ‒ An  HSA  compliant  GPU  allows  for  a  very  efficient  transla[on  between  these  two  representa[ons  ‒ Current  compute  &  graphics  API’s  can  be  supported  in  this  scheme    ‒ With  na[ve  support  for  a  “flat  model”  in  upcoming  modern  OS,  direct,  “flat”,  cache  coherent  references  to  memory  resources  will  become  easier  to  use  directly  over  [me,  reducing  the  need  for  explicit  transla[on  

!  Take  advantage  of  all  the  GPUs  and  all  the  memory  you  find  on  a  system!  ‒ There’s  oien  more  than  one  and  all  have  their  advantages  

Page 16: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

16   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

WHERE  TO  FIND  MORE  INFORMATION  

!  AMD  Accelerated  Parallel  Processing  (APP)  SDK:  ‒  hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/amd-­‐accelerated-­‐parallel-­‐processing-­‐app-­‐sdk/  ‒ AMD  APP  SDK  is  a  complete    development  pla{orm,  providing  samples,  documenta[on  and  other  materials  to  quickly  get  you  started  using  OpenCL™,  Bolt  (Open  Source  C++  Template    Library  for    GPU  parallel  processing),    C++AMP  or  Aparapi  for  Java  applica[ons  

!  AMD  CodeXL:  ‒ hup://developer.amd.com/tools-­‐and-­‐sdks/heterogeneous-­‐compu[ng/codexl/  ‒ A  powerful  tools  suite  for  Windows®  and  Linux®  heterogeneous  applica[on    debugging  and  profiling  ‒ Works  standalone  and    e.g.    Integrated  as  Visual  Studio  extension  

!  AMD  Developer  Central:  hup://developer.amd.com  ‒ Docs,  whitepapers,  tools;  Everything  you  want  to  know  and  need  to  write  performant  programs    on  heterogeneous  systems.  ‒  It’s  not  about  either  CPU  or  GPU,  its  about  both…  

THIS  PRESENTATION  IS  ONLY  A  START…  

Page 17: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

17   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

GO  AHEAD  ☺  

Page 18: HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer

18   |      BEING  SPECIAL  IN  A  UNIFIED  MEMORY  WORLD  |      NOVEMBER  13,  2013  |  APU13  

DISCLAIMER  &  ATTRIBUTION  

The  informa[on  presented  in  this  document  is  for  informa[onal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.    

The  informa[on  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap  changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  soiware  changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obliga[on  to  update  or  otherwise  correct  or  revise  this  informa[on.  However,  AMD  reserves  the  right  to  revise  this  informa[on  and  to  make  changes  from  [me  to  [me  to  the  content  hereof  without  obliga[on  of  AMD  to  no[fy  any  person  of  such  revisions  or  changes.    

AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY  INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.    

AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE  LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION  CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.  

 

ATTRIBUTION  

©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combina[ons  thereof  are  trademarks  of  Advanced  Micro  Devices,  Inc.  in  the  United  States  and/or  other  jurisdic[ons.  OpenCL  is  a  trademark  of  Apple  Corp.  and  Linux  is  a  trademark  of  Linus  Torvalds  and  Microsoi  is  a  trademark  of  Microsoi  Corp.  PCI  Express  is  a  trademark  of  PCI  SIG  Corpora[on.  Other  names  are  for  informa[onal  purposes  only  and  may  be  trademarks  of  their  respec[ve  owners.