the programmer’s guide to the apu galaxydeveloper.amd.com/wordpress/media/2013/06/phil... · 4 |...

34
THE PROGRAMMER’S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD

Upload: others

Post on 07-Jun-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

THE PROGRAMMER’S GUIDE

TO THE APU GALAXY

Phil Rogers, Corporate Fellow

AMD

Page 2: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

2 | The Programmer’s Guide to the APU Galaxy | June 2011

THE OPPORTUNITY WE ARE SEIZING

Make the unprecedented

processing capability of

the APU as accessible to

programmers as the

CPU is today.

Page 3: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

3 | The Programmer’s Guide to the APU Galaxy | June 2011

OUTLINE

The APU today and its programming

environment

The future of the heterogeneous platform

AMD Fusion System Architecture

Roadmap

Software evolution

A visual view of the new command

and data flow

Page 4: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

4 | The Programmer’s Guide to the APU Galaxy | June 2011

APU: ACCELERATED PROCESSING UNIT

The APU has arrived and it is a great advance

over previous platforms

Combines scalar processing on CPU with

parallel processing on the GPU and high

bandwidth access to memory

How do we make it even better going forward?

– Easier to program

– Easier to optimize

– Easier to load balance

– Higher performance

– Lower power

Page 5: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

5 | The Programmer’s Guide to the APU Galaxy | June 2011

LOW POWER E-SERIES AMD FUSION APU: “ZACATE”

2 x86 Bobcat CPU cores

Array of Radeon™ Cores

Discrete-class DirectX® 11 performance

80 Stream Processors

3rd Generation Unified Video Decoder

PCIe® Gen2

Single-channel DDR3 @ 1066

18W TDP

E-Series APU

Performance:

Up to 8.5GB/s System Memory Bandwidth

Up to 90 Gflop of Single Precision Compute

Page 6: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

6 | The Programmer’s Guide to the APU Galaxy | June 2011

TABLET Z-SERIES AMD FUSION APU: “DESNA”

2 x86 “Bobcat” CPU cores

Array of Radeon™ Cores

Discrete-class DirectX® 11 performance

80 Stream Processors

3rd Generation Unified Video Decoder

PCIe® Gen2

Single-channel DDR3 @ 1066

6W TDP w/ Local Hardware Thermal Control

Z-Series APU

Performance:

Up to 8.5GB/s System Memory Bandwidth

Suitable for sealed, passively cooled designs

Page 7: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

7 | The Programmer’s Guide to the APU Galaxy | June 2011

MAINSTREAM A-SERIES AMD FUSION APU: “LLANO”

Up to four x86 CPU cores

AMD Turbo CORE frequency acceleration

Array of Radeon™ Cores

Discrete-class DirectX® 11 performance

3rd Generation Unified Video Decoder

Blu-ray 3D stereoscopic display

PCIe® Gen2

Dual-channel DDR3

45W TDP

A-Series APU

Performance:

Up to 29GB/s System Memory Bandwidth

Up to 500 Gflops of Single Precision Compute

Page 8: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

8 | The Programmer’s Guide to the APU Galaxy | June 2011

COMMITTED TO OPEN STANDARDS

AMD drives open and de-facto standards

– Compete on the best implementation

Open standards are the basis for large

ecosystems

Open standards always win over time

– SW developers want their applications

to run on multiple platforms from

multiple hardware vendors

DirectX®

Page 9: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

9 | The Programmer’s Guide to the APU Galaxy | June 2011

A NEW ERA OF PROCESSOR PERFORMANCE

?

Sin

gle

-thre

ad

Perf

orm

ance

Time

we are

here

Enabled by:

Moore’s Law

Voltage

Scaling

Constrained by:

Power

Complexity

Single-Core Era

Modern

Applic

ation

Perf

orm

ance

Time (Data-parallel exploitation)

we are

here

Heterogeneous

Systems Era

Enabled by:

Abundant data

parallelism

Power efficient

GPUs

Temporarily

Constrained by:

Programming

models

Comm.overhead

Thro

ughput

Perf

orm

ance

Time (# of processors)

we are

here

Enabled by:

Moore’s Law

SMP

architecture

Constrained by:

Power

Parallel SW

Scalability

Multi-Core Era

Assembly C/C++ Java … pthreads OpenMP / TBB … Shader CUDA OpenCL !!!

Page 10: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

10 | The Programmer’s Guide to the APU Galaxy | June 2011

EVOLUTION OF HETEROGENEOUS COMPUTINGA

rch

ite

ctu

re M

atu

rity

& P

rog

ram

me

r A

cce

ssib

ility

Po

or

Ex

ce

lle

nt

2012 - 20202009 - 20112002 - 2008

Graphics & Proprietary

Driver-based APIs

Proprietary Drivers Era

“Adventurous” programmers

Exploit early programmable

“shader cores” in the GPU

Make your program look like

“graphics” to the GPU

CUDA™, Brook+, etc

OpenCL™, DirectCompute

Driver-based APIs

Standards Drivers Era

Expert programmers

C and C++ subsets

Compute centric APIs , data

types

Multiple address spaces with

explicit data movement

Specialized work queue based

structures

Kernel mode dispatch

AMD Fusion System Architecture

GPU Peer Processor

Architected Era

Mainstream programmers

Full C++

GPU as a co-processor

Unified coherent address space

Task parallel runtimes

Nested Data Parallel programs

User mode dispatch

Pre-emption and context

switching

See Herb Sutter’s Keynote

tomorrow for a cool example of

plans for the architected era!

Page 11: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

11 | The Programmer’s Guide to the APU Galaxy | June 2011

FSA FEATURE ROADMAP

System

Integration

GPU compute

context switch

GPU graphics

pre-emption

Quality of Service

Extend to

Discrete GPU

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User mode scheduling

Physical

Integration

Integrate CPU & GPU

in silicon

Unified Memory

Controller

Common

Manufacturing

Technology

Page 12: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

12 | The Programmer’s Guide to the APU Galaxy | June 2011

FUSION SYSTEM ARCHITECTURE – AN OPEN PLATFORM

Open Architecture, published specifications

– FSAIL virtual ISA

– FSA memory model

– FSA dispatch

ISA agnostic for both CPU and GPU

Inviting partners to join us, in all areas

– Hardware companies

– Operating Systems

– Tools and Middleware

– Applications

FSA review committee planned

Page 13: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

13 | The Programmer’s Guide to the APU Galaxy | June 2011

FSA INTERMEDIATE LAYER - FSAIL

FSAIL is a virtual ISA for parallel programs

– Finalized to ISA by a JIT compiler or

“Finalizer”

Explicitly parallel

– Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Syscall methods

– GPU code can call directly to system

services, IO, printf, etc

Debugging support

Page 14: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

14 | The Programmer’s Guide to the APU Galaxy | June 2011

FSA MEMORY MODEL

Designed to be compatible with C++0x,

Java and .NET Memory Models

Relaxed consistency memory model for

parallel compute performance

Loads and stores can be re-ordered by

the finalizer

Visibility controlled by:

– Load.Acquire, Store.Release

– Fences

– Barriers

Page 15: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

15 | The Programmer’s Guide to the APU Galaxy | June 2011

Hardware - APUs, CPUs, GPUs

AMD user mode component AMD kernel mode component All others contributed by third parties or AMD

Driver Stack

Domain Libraries

OpenCL™ 1.x, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

FSA Software Stack

Task Queuing

Libraries

FSA Domain Libraries

FSA Kernel

Mode Driver

FSA Runtime

FSA JIT

AppsApps

AppsApps

AppsApps

Page 16: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

16 | The Programmer’s Guide to the APU Galaxy | June 2011

OPENCL™ AND FSA

FSA is an optimized platform architecture

for OpenCL™

– Not an alternative to OpenCL™

OpenCL™ on FSA will benefit from

– Avoidance of wasteful copies

– Low latency dispatch

– Improved memory model

– Pointers shared between CPU and GPU

FSA also exposes a lower level programming

interface, for those that want the ultimate in

control and performance

– Optimized libraries may choose the lower

level interface

Page 17: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

17 | The Programmer’s Guide to the APU Galaxy | June 2011

TASK QUEUING RUNTIMES

Popular pattern for task and data parallel

programming on SMP systems today

Characterized by:

– A work queue per core

– Runtime library that divides large

loops into tasks and distributes to

queues

– A work stealing runtime that keeps the

system balanced

FSA is designed to extend this pattern to

run on heterogeneous systems

Page 18: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

18 | The Programmer’s Guide to the APU Galaxy | June 2011

TASK QUEUING RUNTIME ON CPUS

CPU Threads GPU Threads Memory

Work Stealing Runtime

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

X86 CPU X86 CPU X86 CPU X86 CPU

Page 19: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

19 | The Programmer’s Guide to the APU Galaxy | June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM

CPU Threads GPU Threads Memory

Work Stealing Runtime

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

GPU

Manager

Q

X86 CPU X86 CPU X86 CPU X86 CPU Radeon™ GPU

Page 20: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

20 | The Programmer’s Guide to the APU Galaxy | June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM

Memory

SI

MD

SI

MD

SI

MD

SI

MD

SI

MD

Work Stealing Runtime

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

CPU

Worker

Q

GPU

Manager

Q

Fetch and Dispatch

X86 CPU X86 CPU X86 CPU X86 CPU

CPU Threads GPU Threads Memory

Page 21: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

21 | The Programmer’s Guide to the APU Galaxy | June 2011

FSA SOFTWARE EXAMPLE - REDUCTION

float foo(float);

float myArray[…];

Task<float, ReductionBin> task([myArray]( IndexRange<1> index) [[device]] {

float sum = 0.;

for (size_t I = index.begin(); I != index.end(); i++) {

sum += foo(myArray[i]);

}

return sum;

});

float result = task.enqueueWithReduce( Partition<1, Auto>(1920),

[] (int x, int y) [[device]] { return x+y; }, 0.);

Page 22: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

22 | The Programmer’s Guide to the APU Galaxy | June 2011

HETEROGENEOUS COMPUTE DISPATCH

How compute dispatch operates

today in the driver model

How compute dispatch

improves tomorrow under FSA

Page 23: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

23 | The Programmer’s Guide to the APU Galaxy | June 2011

TODAY’S COMMAND AND DISPATCH FLOW

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Hardware

Queue

A GPU

HARDWARE

Page 24: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

24 | The Programmer’s Guide to the APU Galaxy | June 2011

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

TODAY’S COMMAND AND DISPATCH FLOW

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Hardware

Queue

A

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

GPU

HARDWARE

Page 25: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

25 | The Programmer’s Guide to the APU Galaxy | June 2011

TODAY’S COMMAND AND DISPATCH FLOW

Hardware

Queue

A

C

BA B

GPU

HARDWARE

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Page 26: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

26 | The Programmer’s Guide to the APU Galaxy | June 2011

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

A

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

C

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

Command Flow Data Flow

Soft

Queue

Kernel

Mode

Driver

Application

B

Command Buffer

User

Mode

Driver

Direct3D

DMA Buffer

TODAY’S COMMAND AND DISPATCH FLOW

Hardware

Queue

A GPU

HARDWARE

C

BA B

Page 27: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

27 | The Programmer’s Guide to the APU Galaxy | June 2011

FUTURE COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional Dispatch

Buffer

GPU

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

No APIs

No Soft Queues

No User Mode Drivers

No Kernel Mode Transitions

No Overhead!

Application codes to the

hardware

User mode queuing

Hardware scheduling

Low dispatch times

Page 28: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

28 | The Programmer’s Guide to the APU Galaxy | June 2011

Application / Runtime

FUTURE COMMAND AND DISPATCH CPU <-> GPU

CPU2CPU1 GPU

Page 29: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

29 | The Programmer’s Guide to the APU Galaxy | June 2011

Application / Runtime

FUTURE COMMAND AND DISPATCH CPU <-> GPU

CPU2CPU1 GPU

Page 30: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

30 | The Programmer’s Guide to the APU Galaxy | June 2011

Application / Runtime

FUTURE COMMAND AND DISPATCH CPU <-> GPU

CPU2CPU1 GPU

Page 31: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

31 | The Programmer’s Guide to the APU Galaxy | June 2011

Application / Runtime

FUTURE COMMAND AND DISPATCH CPU <-> GPU

CPU2CPU1 GPU

Page 32: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

32 | The Programmer’s Guide to the APU Galaxy | June 2011

WHERE ARE WE TAKING YOU?

Switch the compute, don’t move

the data!Platform Design Goals

Every processor now has serial and

parallel cores

All cores capable, with performance

differences

Simple and

efficient program

model

Easy support of massive data sets

Support for task based programming

models

Solutions for

all platforms

Open to all

Page 33: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING

33 | The Programmer’s Guide to the APU Galaxy | June 2011

THE FUTURE OF HETEROGENEOUS COMPUTING

The architectural path for the future is clear

– Programming patterns established on

Symmetric Multi-Processor (SMP)

systems migrate to the heterogeneous

world

– An open architecture, with published

specifications and an open source

execution software stack

– Heterogeneous cores working together

seamlessly in coherent memory

– Low latency dispatch

– No software fault lines

Page 34: THE PROGRAMMER’S GUIDE TO THE APU GALAXYdeveloper.amd.com/wordpress/media/2013/06/Phil... · 4 | The Programmer’s Guide to the APU Galaxy | June 2011 APU: ACCELERATED PROCESSING