© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 1
ENABLING EFFICIENT HETEROGENEOUS
PROCESSING THROUGH COHERENCY: AN
HSA FOUNDATION UPDATE EMBEDDED VISION ALLIANCE MEMBER MEETING
DR. JOHN GLOSSNER, PRESIDENT, HSA FOUNDATION / CEO GPT-US
HARMONIZING THE INDUSTRY AROUND
HETEROGENEOUS COMPUTING
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 2
AGENDA
Heterogeneous Programming
Problem
About HSA Founding
Member Companies
Open / Royalty Free Solutions
HSA Solution Hardware
Software Infrastructure
HSAIL
Portable Applications
Programming C/C++, Python, OpenCL
Performance Results AMD
Products and Announcements AMD, GPT, Imagination, MediaTek
What’s Next
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 3
THE PROBLEM HETEROGENEOUS APPLICATION DEVELOPMENT
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 4
WHAT IS A HETEROGENEOUS SYSTEM?
A CPU+ System +GPU
+Vision Processors
+DSP
+FPGA
+Accelerators
Typically Different development tools
Different memory spaces
Communication via I/O only (data copies)
Unified Coherent Memory
CPU
1
CPU
N …
CPU
2
GPU
1
GPU
2
GPU
3
GPU
M DSP ACC …
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 5
WHAT’S THE PROBLEM?
Heterogeneous processors are
widely available
Huge compute capability Acceleration Units (GPU, DSP, FPGA)
CPU Cluster-based computer
Coherency Established in high-end
Migrating to mainstream mobile and
consumer
BUT…
Heterogeneous programming
models not standardized
Multi-core/device applications
difficult to optimize or scale
Non-portable application
developer ecosystems
HSAF brings compute app abstraction to heterogeneous platforms
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 6
HSA TECHNOLOGY
Developing a new platform for heterogeneous systems Reducing Heterogeneous System Complexity
Provides software ecosystem
Abstracts away complexities of heterogeneous systems Cache coherent shared virtual memory hardware
Removes time consuming operating system calls
Runs at user level
Exploiting Compute Capabilities Single source programming
Control and compute code reside in the same file or project
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 7
ABOUT HSA HETEROGENEOUS SYSTEM ARCHITECTURE FOUNDATION
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 8
HSA FOUNDATION
A Non-Profit Foundation Founded in June 2012 Programming heterogeneous systems (“CPU +” era)
Industry standards body V1.1 Specifications released May 2016
Backward compatibility with V1.0 hardware
First compatible hardware AMD
Measured Performance Improvements
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 9
HSA – AN OPEN PLATFORM
Open Architecture, membership open to all HSA Programmers Reference Manual
HSA Platform System Architecture
HSA Runtime
HSA Multivendor Specification
Royalty Free IP, Specifications, and APIs
Open Source Tools, Compilers, etc.
Runtime implementations
Tests
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 10
MEMBERS DRIVING HSA Founders
Promoters
Supporters
Contributors
Academic
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 11
HSAF HARDWARE CONTRIBUTIONS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 12
JIM MCGREGOR, TIRIAS RESEARCH
…HSAF has had a profound impact on hardware
architectures
… even Intel‘s Cache
Coherency
Unified memory
(Shared Virtual Memory)
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 13
THE PLATFORM PILLARS OF HSA
Unified memory
(SVM)
User mode dispatch
Platform atomics
Architected
Signals
Formal
Relaxed
Memory Model
Cache
Coherency
Quality
Of
Service Some non-HSA platforms support a few
of these platform features
In combination they form a well-rounded
base for application programmability
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 14
HSAF SOFTWARE INFRASTRUCTURE
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 15
THE VISION
Make Heterogeneous Programming Much Easier
Single source programming 1
Any programming language 2
Eliminate data copies 3
Common address space 4
Standardized command submission to Agents (GPU / DSP) 5
Eliminate software layers between application and hardware 6
ISA agnostic for CPU, GPU, DSP, and more 7
Open source software stack 8
Single tool chain
C++, Python, JavaScript, …
Performance!
A pointer is a pointer
A common dispatch language
Efficient
x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …
Open Access!
High performance
Low power
Extensible to other accelerators on the SoC
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 16
MOTIVATION (TODAY’S PICTURE)
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule
Application Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 17
WITH SHARED VIRTUAL MEMORY
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule
Application Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 18
WITH COHERENT CACHE MEMORY
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule
Application Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 19
SIGNALS
HSA agents support signaling creation/destruction using runtime APIs
Any Agent can access signals Wake up agents waiting upon the object
Query/Wait for current object
Allows conditions
Hardware-assisted signaling and
synchronization primitives Memory semantics synchronizes work
items processed by HSA agents
Synchronizes execution between threads
on HSA agents and host CPU
One-to-one and one-to-many
signaling System Software, runtime & application SW
use infrastructure to build higher-level
synchronization primitives like mutexes,
semaphores, …
Advantages Asynchronous events between agents
Doesn’t require CPU
Common idiom for work offload
Low power waiting
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 20
WITH SIGNALING
Application OS
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule
Application Get Buffer
Copy/Map
Memory
Agent GPU/DSP
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 21
HSA QUEUING MODEL
User mode queuing Low latency dispatch
Application dispatches directly
No OS or driver required
Architected Queuing Layer (AQL) Single compute dispatch path for all hardware
No driver translation, direct to hardware
Standard across vendors!
Guaranteed backward compatibility
Allows for dispatch to queue from any agent CPU or GPU or DSP or FPGA, etc.
Agent self enqueue enables Recursion, Tree traversal, Wavefront reforming
Requires coherency and
shared virtual memory
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 22
WITH USER MODE QUEUING
Application CPU OS Agent GPU/DSP
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule
Application Get Buffer
Copy/Map
Memory
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 23
FINAL PICTURE: SVM + CACHE COHERENCY +
SIGNALS + USER MODE QUEUES
Application OS Agent GPU/DSP
Queue Job
Start Job
Finish Job
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 24
HSA COMMAND AND DISPATCH FLOW
Application
A
Application
B
Application
C
Optional
Dispatch Buffer
Agent
HARDWARE
Hardware Queue
A
A A
Hardware Queue
B
B B
Hardware Queue
C
C C
C
C
HW view:
HW / microcode controlled
HW scheduling
Architected Queuing Language (AQL)
HW-managed protection
SW view:
User-mode dispatches to HW
No OS Driver overhead
Low dispatch times
Host & Kernel Agent dispatch APIs
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 25
HSA INTERMEDIATE LANGUAGE (HSAIL) BYTECODE FOR HETEROGENEOUS SYSTEMS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 26
THE PORTABILITY CHALLENGE
CPU ISAs – Backwards Compatible ISA innovations added incrementally (ie NEON, AVX, etc)
ISA retains backwards-compatibility with previous generation
HSA instruction-set architectures: ARM, GPT, MIPS, and x86
Kernel Agent ISAs – No Backwards Compatibility GPU, DSP, DNN, Image Signal Processor, Custom Accelerators, etc.
Massive diversity of architectures in the market Each vendor has own ISA - and often several in market at same time
Compatibility via APIs (OpenGL, DirectX, OpenCV)
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 27
HSA INTERMEDIATE LAYER — HSAIL
Virtual ISA for parallel programs Finalized to native ISA by a compiler
Dynamic or Offline
ISA independent by design
Explicitly parallel Designed for data parallel programming
Multiple HLL Support Exceptions, virtual functions, etc.
Java, C++, OpenMP, C++, Python, etc
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler BRIG Finalizer Component
ISA
Host ISA
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 28
HSAIL FEATURES
A Virtual Explicitly Parallel ISA ~135 Opcodes
RISC Register-based Load/Store
Arithmetic IEEE 754 Floating Point including 16-bit
Integer (32/64-bit)
DSP fixed point
Packed / SIMD f16x2, f16x4, f16x8, f32x2, f32x4, f64x2
signed/unsigned 8x4, 8x8, 8x16, 16x2, 16x4,
16x8, 32x2, 32x4, 64x2
Branches & Function Calls
Atomic Operations
Wavefronts 1, 2, 4, 8, 16, 32, or 64 SIMD lanes
Lanes can be active or inactive
Memory Shared Virtual Memory
Exceptions
ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)
add_u64 $d1, $d0, 24 ; $d1= $d2+24
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 29
PORTABLE APPLICATIONS PROGRAMMING FROM OPENCL TO C++17
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 30
HSA OPEN SOURCE SOFTWARE
Full open source Linux stack: tools, compilers and OS support Allows a single shared implementation for many components
Enables university research and industry collaboration in all areas
Because it’s the right thing to do
Many open source applications & frameworks Native Languages: HCC (C++17), LLVM, GCC, CLOC/SNACK, Python, Java, …
Tools, API’s, Frameworks: CodeXL, POCL, Docker, OpenMP, OKRA, HIP, …
Research: Multi2sim, HSAEmu, gem5, ViennaCL, …
And many applications using OCL 2.x or HSA stack
Github & Bitbucket repositories have much, much more…
gccbrig Any processor with a gcc machine description can finalize HSAIL
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 31
ARCHITECTED PROFILING AND DEBUGGING
Profiling
• Common timeline across HSA accelerators & system
• Common HSA hardware events (+ HW specific)
• Common HSA profiling counter definitions (+ HW specific counters)
• Consistent profiling methodology for all HSA accelerators
Debugging
• Breakpoints
• Exception handling
• Single-step
• Tracing
• HSAIL Disassembly
• Emulation support
• Libraries
• Plugins
Python on GPU’s
Numba: NumPy aware python compiler Open source. Avail on Github
Sponsored by Continuum Analytics
Direct HSA Support
Automatic Parallelization 2x-200x speedup
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 33
PERFORMANCE RESULTS
Python Geographic Locality
What is the distance from a set of points to a target point How many points are within a specified
range
Numba can auto-parallelize user universal functions for HSA Ufunc’s broadcast operation over
elements of a NumPy array ZERO HSA developer knowledge
required
1M Points >8X speedup
https://github.com/ContinuumIO/Numba-HSA-Webinar
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 35
GEN1: FIR & AES
FIR is a memory-intensive streaming workload
AES is a compute-intensive streaming workload
CL12 – cl_mem buffer Copy to/from the device
CL20 – SVM buffer – Coarse Grain Sync Copy to/from SVM Data copy cannot be avoided, since the space for
SVM is limited
HSA – Unified Memory Space – Fine Grained Sync Regular pointer No explicit copy
Results HSA compute abstraction NO performance penalty
Note: Not all algorithms run faster Benchmark: NUCAR HeteroMark
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 36
HSA PRODUCT UPDATES - AMD FROM HSA FOUNDATION MEMBER COMPANIES
37
Heterogeneous System Architecture Is At The Core of ROCm Rich Foundation for HPC and Ultrascale Computing support our APU’s and Discreet GPU’s
HSA Drives rich capabilities into the ROCm
Systems Architecture ‒ User Mode Queues
‒ Architected Queuing Language
‒ Flat memory Addressing
‒ Atomic Memory Transactions
‒ Process Concurrency & Preemption
HSA Runtime enables a programming language
neutral systems interface Supports standardized loader and linker interface
ROCm: Radeon
Open Compute
Platform
38
ROCm Enabled Hardware 2016
S9150 W9100
RADEON R9 Nano S9300x2 RADEON RX480 ( Oct ROCm 1.3)
S9170
AMD Proprietary and Confidential August 2016
AMD Embedded R-Series SOC
AMD FX 98xx, A12-97xx
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 39
HXGPT ANNOUNCEMENT UNITY处理器架构
Working silicon for Unity Architecture
Focus on out-of-order pipeline Superscalar
Control flow
See our demo Image Processing Filter
Before After
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 40
HXGPT ANNOUNCEMENT
基于HSAIL的深度学习-神经网络开源计划
Open Source Machine Learning HSAIL library Deep Neural Network Library
Delivered in HSAIL
Any HSA-Compatible platform can execute
Optimized for hxGPT Using gccbrig
Development now underway
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 41
IMAGINATION HSA COMPLIANT IP (COMING SOON)
We will be rolling out:
• HSA across all
MIPS I-class and
P-class CPUs
• HSA across all
PowerVR GPUs
• HSA compliant
fabric solutions
Coherent HSA-compliant SoC fabric
PowerVR Video Encode
PowerVR Camera ISP
PowerVR Video Decode
ROM
Peripheral Bus
DD
R3
/4
Bridge
RAM
PowerVR GX7200 Series6XT 2 cluster
PowerVR GPU
HSA-compliant
eF
use
DMAC
Clock &
Reset
Control
JTAG
& Test
PSU &
Power
Control
TE &
Crypto
L2 cache
PowerVR GX7200
Series6XT 2 cluster
MIPS CPU
HSA-compliant
Display Pipeline
PowerVR JPEG Encode
OTP
Ensigma RPU
AFE
Customer IP
HDMI
Tx & Rx USB3 MIPI NAND
Peripherals
GPIO; UART; I2C; I2S; SPI; SD
Customer IP
Customer IP
& interfaces
Imagination Smart Vision IP Platform
Copyright © MediaTek Inc. All rights reserved.
HMP – 2013 Heterogeneous Multi-
Processing
HC – 2015 Heterogeneous
Computing
Tri-cluster 2016
Hybrid Tri-cluster Multi-Processing
HSA Features Heterogeneous
System Architecture
LITTLE CPUs
BIG CPUs
LITTLE CPUs
BIG CPUs
GPU GPU
Accelerators
Co
heren
t Mem
ory
MM
U
Evolution of Heterogeneity at MediaTek
Min CPUs
Max CPUs
GPU
Mid CPUs
Min CPUs
Max CPUs
Mid CPUs
42
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 43
CONCLUSIONS
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 44
THE RESULT
Make Heterogeneous Programming Much Easier
Single source programming 1
Any programming language 2
Eliminate data copies 3
Common address space 4
Standardized command submission to Agents (GPU / DSP) 5
Eliminate software layers between application and hardware 6
ISA agnostic for CPU, GPU, DSP, and more 7
Open source software stack 8
Single tool chain
C++, Python, JavaScript, …
Performance!
A pointer is a pointer
A common dispatch language
Efficient
x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …
Open Access!
High performance
Low power
Extensible to other accelerators on the SoC
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 45
SUMMARY
2012 goal of changing chip H/W architecture achieved Cache coherent shared virtual memory
2014-2015 S/W architecture to support H/W March 2015 V1.0 specs
Programmed in any language (C++, Python, OpenCL)
2015-2016 May 2016 v1.1 specs
Multivendor support
Wider range of processors
2016 H/W platforms arriving AMD’s Carrizo (Dell, Asus, Lenovo)
Licensable IP available
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 46
V1.2 SPECIFICATIONS IN PROGRESS
Improved Data Interop
Fixed function accelerators
(e.g. FPGA)
Local device memory
Coarse grain
memory Architected Debug
BRIG, new linking formats
Architecture Fully formalized memory model
HSAIL Parallel loops
Flexible API and access semantics
Programming
Models
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 47
JOIN US! WWW.HSAFOUNDATION.COM
THANK YOU