heterogeneous system architecture: from …events.csdn.net/amd/amd - hpc china 2012 - hsa... ·...

29
HETEROGENEOUS SYSTEM ARCHITECTURE: FROM THE HPC USAGE PERSPECTIVE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China

Upload: truonghuong

Post on 05-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

HETEROGENEOUS

SYSTEM ARCHITECTURE:

FROM THE HPC USAGE

PERSPECTIVE

Haibo Xie, Ph.D.

Chief HSA Evangelist

AMD China

AGENDA:

GPGPU in HPC, what are the challenges

Introducing Heterogeneous System Architecture (HSA)

How HSA benefits GPGPU in HPC usage

Taking HSA to the Industry

3 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

GPU IN HPC – WHAT ARE THE CHALLENGES?

Massively Parallel Processing ?

Finding Parallelism ?

SIMDs/Vector-Arrays ?

Bringing Data to Computation ?

Refine the algorithm ?

--

5 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

THE PROBLEM – WHY IS IT DIFFICULT?

Not every HPC domain-science

programmer could use GPUs

Efforts on tailoring algo.

Even the size of the problem

Code reuse remains an issues

Algo., programming

Data transfer cost

Distributed memory space between

CPU and GPU embarrassed (legacy)

programming models

High software runtime overhead

Special purpose devices that lacks

the necessary tools

Hardware, tool-chain

6 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

BUT…

US Department of Energy's 20 MW expectation

Getting performance is still a problem in general

purpose HPC

Hybrid computing became a common term,

heterogeneity is now becoming a norm

ExaScale system is probably going to end up

being a optimization problem to solve

Several efforts still targeted at utilizing GPUs in

HPC

7 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

RE-THINKING CPU+dGPU

8 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

CHANGING THE THINKING

9 HPC China 2012 | HSA: from the HPC usage perspective | Oct, 30, 2012

INTRODUCING HETEROGENEOUS SYSTEM ARCHITECTURE Brings All the Processors in a System into Unified Coherent Memory

10 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

INTRODUCING HETEROGENEOUS SYSTEM ARCHITECTURE Brings All the Processors in a System into Unified Coherent Memory

POWER EFFICIENT

EASY TO PROGRAM

FUTURE LOOKING

ESTABLISHED TECHNOLOGY FOUNDATION

OPEN STANDARD

INDUSTRY SUPPORT

11 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HSA APU FEATURE ROADMAP

System

Integration

GPU compute

context switch

GPU graphics

pre-emption

Quality of Service

Extend to

Discrete GPU

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User mode scheduling

Physical

Integration

Integrate CPU & GPU

in silicon

Unified Memory

Controller

Common

Manufacturing

Technology

12 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HSA COMPLIANT FEATURES

Optimized

Platforms

Bi-Directional Power

Mgmt between CPU

and GPU

GPU Compute C++

support

User mode scheduling

Support OpenCL C++ directions and Microsoft’s upcoming C++ AMP language.

This eases programming of both CPU and GPU working together to process

parallel workloads.

Drastically reduces the time to dispatch work, requiring no OS kernel transitions

or services, minimizing software driver overhead

Enables “power sloshing” where CPU and GPU are able to dynamically lower or

raise their power and performance, depending on the activity and which one is

more suited to the task at hand.

13 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HSA COMPLIANT FEATURES

The unified address space provides ease of programming for developers to create

applications. For HSA platforms, a pointer is really a pointer and does not require

separate memory pointers for CPU and GPU.

The GPU can take advantage of the CPU virtual address space. With pageable

system memory, the GPU can reference the data directly in the CPU domain. In

prior architectures, data had to be copied between the two spaces or page-locked

prior to use. And, NO GPU memory size limitation!

Allows for data to be cached by both the CPU and the GPU, and referenced by

either. In all previous generations, GPU caches had to be flushed at command

buffer boundaries prior to CPU access. And unlike discrete GPUs, the CPU

and GPU in an APU share a high speed coherent bus.

Architectural

Integration

Unified Address Space

for CPU and GPU

Fully coherent memory

between CPU & GPU

GPU uses pageable

system memory via

CPU pointers

14 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

GPU tasks can be context switched, making the GPU a multi-tasker. Context

switching means faster application, graphics and compute

interoperation. Users get a snappier, more interactive experience.

As more applications enjoy the performance and features of the GPU, it is important

that interactivity of the system is good. This means low latency access to the GPU

from any process.

With context switching and pre-emption, time criticality is added to the tasks

assigned to the processors. Direct access to the hardware for multi-users or

multiple applications are either prioritized or equalized.

FULL HSA FEATURES

System

Integration

GPU compute context

switch

Quality of service

GPU graphics pre-

emption

15 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HSA SOLUTION STACK

System components:

– Compliant heterogeneous computing hardware

– A software compilation stack

– A user - space runtime system

– Kernel - space system components

Overall Vision:

– Make GPU easily accessible

Support mainstream languages, expandable to

domain specific languages

Complete GPU tool-chain, Programming &

debugging & profiling like CPU does

– Make compute offload efficient

Direct path to GPU (avoid Graphics overhead)

Eliminate memory copy, Low-latency dispatch

– Make it ubiquitous

Drive HSA as a standard through HSA Foundation

Open Source key components

Application SW

Drivers

Differentiated HW CPU(s) GPU(s) Other

Accelerators

HSA Finalizer

Legacy

Drivers

Application

Domain Specific Libs

(Bolt, OpenCV™, … many others)

HSA Runtime

DirectX

Runtime

Other

Runtime

HSAIL

GPU ISA

OpenCL™

Runtime

HSA Software

16 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

Hardware - APUs, CPUs, GPUs

AMD user mode component AMD kernel mode component All others contributed by third parties or AMD

Driver Stack

Domain Libraries

OpenCL™ 1.x, DX Runtimes,

User Mode Drivers

Graphics Kernel Mode Driver

Apps Apps

Apps Apps

Apps Apps

HSA Software Stack

Task Queuing

Libraries

HSA Domain Libraries

HSA Kernel

Mode Driver

HSA Runtime

HSA JIT

Apps Apps

Apps Apps

Apps Apps

17 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HETEROGENEOUS COMPUTE DISPATCH

How compute dispatch operates

today in the driver model

How compute dispatch

improves tomorrow under HSA

18 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

Application / Runtime

HSA COMMAND AND DISPATCH CPU <-> GPU

CPU2 CPU1 GPU

19 HPC Advisory Council | HSA: platform for the future | Oct, 28, 2012

HSA INTERMEDIATE LAYER - HSAIL

HSAIL is a virtual ISA for parallel programs

– Finalized to ISA by a JIT compiler or

“Finalizer”

– Low level for fast JIT compilation

Explicitly parallel

– Designed for data parallel programming

Support for exceptions, virtual functions,

and other high level language features

Syscall methods

– GPU code can call directly to system

services, IO, printf, etc

Debugging support

20 HPC China 2012 | HSA: from the HPC usage perspective | Oct, 30, 2012

HSA TAKING PLATFORM TO PROGRAMMERS

Balance between CPU and GPU for performance and power efficiency

Make GPUs accessible to wider audience of programmers

– Programming models close to today’s CPU programming models

– Enabling more advanced language features on GPU

– Shared virtual memory enables complex pointer-containing data structures (lists, trees,

etc) and hence more applications on GPU

– Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)

• Enabling task-graph style algorithms, Ray-Tracing, etc

Complete tool-chain for programming, debugging and profiling

HSA provides a compatible architecture across a wide range of programming models

and HW implementations.

21 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

HSA VALUES GPGPU – EASIER TO PROGRAM

Cacheable and coherent memory, more data

structure allowed to be freely shared

More programming models support, OpenCL,

C++ AMP, OpenMP

Single Source for all processors on the SOC

Pointer is a pointer!

Expressive runtime for rich high level

programming language, C/C++, Java, Python, C#

22 HPC China 2012 | HSA: from the HPC usage perspective | Oct, 30, 2012

HSA VALUES GPGPU – PERFORMANCE AND POWER EFFICIENCY

Pass pointer rather than moving data, support

more problem with different dataset

Reduced kernel launch time and efficient

CPU/GPU communication

Hardware managed queue and scheduling, allows

for very low-latency comm. between devices.

Good for performance and power effficiencyPre-

emption and context switching, Support for

Multiple Concurrent GPU process, Preemptive

Multitasking of CPU/GPU resources

Pre-emption and context switching, Support for

Multiple Concurrent GPU process, Preemptive

Multitasking of CPU/GPU resources

Bi-Directional Power Mgmt between CPU and GPU,

Turbo Core technology for more power efficiency

TAKING HSA TO THE INDUSTRY

© Copyright 2012 HSA Foundation. All Rights Reserved.

HSA FOUNDATION INITIAL FOUNDERS

© Copyright 2012 HSA Foundation. All Rights Reserved. 24

represented by ,

ARM Fellow and VP of Technology, Media Processing

represented by

Vice President, Marketing

represented by ,

Senior Director, CTO Office

represented by ,

Director, Linux Development Center

represented by ,

CVP, Heterogeneous Applications and Developer Solutions

25 HPC China 2012 | HSA: from the HPC usage perspective | Oct, 30, 2012

AMD’S OPEN SOURCE COMMITMENT TO HSA

Component Name AMD Specific Rationale

HSA Bolt Library No Enable understanding and debug

OpenCL HSAIL Code Generator No Enable research

LLVM Contributions No Industry and academic collaboration

HSA Assembler No Enable understanding and debug

HSA Runtime No Standardize on a single runtime

HSA Finalizer Yes Enable research and debug

HSA Kernel Driver Yes For inclusion in linux distros

We will open source our linux execution and compilation stack

– Jump start the ecosystem

– Allow a single shared implementation where appropriate

– Enable university research in all areas

26 HPC China 2012 | HSA: from the HPC usage perspective | Oct, 30, 2012

THE FUTURE OF HETEROGENEOUS COMPUTING

The architectural path for the future is clear

– Programming patterns established on

Symmetric Multi-Processor (SMP) systems

migrate to the heterogeneous world

– An open architecture, with published

specifications and an open source execution

software stack

– Heterogeneous cores working together

seamlessly in coherent memory

– Low latency dispatch

– No software fault lines

APU server will unleash the GPGPU power in

HPC domain

27 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

WHERE ARE WE TAKING YOU?

Switch the compute, don’t move

the data! Platform Design Goals

Every processor now has serial and

parallel cores

All cores capable, with performance

differences

Simple and

efficient program

model

Easy support of massive data sets

Support for task based programming

models

Solutions for

all platforms

Open to all

THANK YOU!

Access HSA:

http://developer.amd.com

http://hc.csdn.net

Haibo Xie:

[email protected]

29 HPC China 2012 | HSA: from the HPC usage perspective | Oct. 30, 2012

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies,

omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not

limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,

product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is

no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information

and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or

changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT,

SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED

HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

© 2012 Advanced Micro Devices, Inc.