"computer vision powered by heterogeneous system architecture (hsa)," a presentation from...

Copyright © 2014 AMD 1

Dr. Harris Gasparakis

5/29/2014

Computer Vision Powered by

Heterogeneous System Architecture

(HSA)


• DEVELOPING EMBEDDED VISION APPLICATIONS: THE PROPRIETARY API LEGACY.

• THE RISE OF GPUS: DENSE DATA PARALLELISM AND CACHE-COHERENT SIMD

• DO WE STILL NEED CPUs?

• THE HETEROGENEOUS FUTURE OF VISION: OPENCL™, HSA

• OPENCL EVOLUTION • OPENCL 1.X

• OPENCL 2.X AND HSA

• THE OPENCL EXECUTION MODEL

• MONSTERS IN THE ORCHESTRA, CES 2014

• PUTTING OPENCL™ 2.0 TO WORK

• OPENCL™ IN OPENCV • OPENCV 3.0: THE TRANSPARENT API

• HOW DOES IT WORK?

• CONCLUDING THOUGHTS

AGENDA


• Choose HW and SW platform

– A multitude of devices of different

capabilities and strengths!

– A multitude of algorithms of different

requirements! Data Parallel (Y/N/M?)

• Highly specialized programmers

• Non-portable programs, with high

platform risk

Is Image

Processing/

Vision your

core IP?

Use somebody’s SDK

or end product

No

Yes

This talk is

not for you

Developing Embedded Vision Applications


Sobel

Just SIMD is NOT

good enough for

good GPU

acceleration,

contrary to

popular wisdom

Merge kernels

Split kernels

• “GPU is great for SIMD: Single Instruction Multiple Data”

• Image Processing = Dense Data Parallelism

• Same calculation (e.g. calculate edge strength) for all pixels.

• Let’s call this CC-SIMD: Cache Coherent SIMD

adjacent threads load adjacent data

Also be careful of:

1. Too simple algorithms (non enough math per memory transfer)

2. Too much complexity per kernel (high register pressure)

The rise of GPUs: Dense Data Parallelism


Features

• Extensive set of CPU libraries

• Also, Vision (image understanding) is a “dense to sparse transition”

• Sparsity is not a GPU’s friend.

• OpenCL 2.0 solves this problem much more optimally

(and with less code)

Do we still need CPUs?


CPU

GPU

Audio

Processor

ISP: Image

Signal

Processing

Fixed

Function

Acctr

Encode

Decode

Sh

are

d M

em

ory

DSP

Other!

The Heterogeneous Future of Vision

THE RIGHT IP FOR THE RIGHT TASK!

• CPU is great for serial tasks

• Lower latency

• Good branching performance

• Lower throughput

• Good at Task parallelism

• Better for Sparse Data Parallelism

• GPU excels at data parallel problems

• High throughput

• Possibly High latency

• Good at “Dense Data Parallelism”

• Increasingly better at task parallelism

(concurrent kernel execution, and

OpenCL 2.0 Dynamic parallelism)

An efficient Heterogeneous System Architecture would be optimal (e.g. GFLOPS/$/W)

Copyright © 2014 AMD 7 © Copyright 201 HSA Foundation. All Rights Reserved.

Founders

Promoters

Supporters

Contributors

Academic

http://www.apical.co.uk/

http://www.multicorewareinc.com/index.php


• OpenCLTM: Khronos Software API

• Cross-platform (Windows, Linux, Mac OS, etc.)

• Multi-vendor (AMD, Apple, IBM, Intel, NVIDIA, etc.),

with maturing support

• Multicore CPU, discrete GPU, integrated GPU (aka

APU), DSP, FPGA, etc.

• HSA: Heterogeneous System Architecture, an industry

standard specification

• OpenCL 2.0 introduces HSA features

• Open Source also helps!

• OpenCV, featuring OpenCL acceleration

Open Standards


OpenCL™ Evolution: Discrete GPU

OpenCL was invented as

an open standards high

level API for GPU

compute, first on discrete

graphics cards

OpenCL abstracts:

• Data management across

multiple memory spaces • Memory buffers / Images

• Compute Instructions • “Kernels”

• Execution on “compute

units” CU.

PCIe

™

CU CU CU CU CU CU CU

CU CU CU CU

GPU device Memory

GPU

Main memory

Host

Memory

PCIe

Memory

(pinned)

CPU

…


OpenCL™ Evolution: Legacy APU

CU CU …

CU CU … CU CU CU CU

GPU

APU: Physical

Integration: CPU

and GPU on

same die

OpenCL (1.x)

works also on

APUs

• Device memory

is (part of) main

memory, but

still must use

memory

buffers!

Host memory Device Visible

Host Memory

Device memory Host Visible

Device Memory

Main memory

CPU

…


OpenCL™ Evolution: HSA Enabled APU

Unified Coherent Memory enables data sharing across all processors and GPU compute units

OpenCL™ 2.x: No need to use memory buffers, just use data pointers, just like you would do on the CPU.

Unified (Bidirectionally Coherent, pageable) Virtual Memory

CU CU …

CU CU …

CU CU CU CU

GPU CPU

Cache Cache

Physical Memory


A 360o x 90o immersive gesture-enabled experience, enabled by OpenCL (and OpenCV)

Monsters in the Orchestra, AMD CES 2014


VISION: Dense to sparse transition

Putting OPENCL™ 2.0 to Work

GPU keypoints

Data changed by a kernel, can be visible by CPU, before

kernel returns requires (fine grain SVM).

CPU

consumes

keypoints

“as they

come”, and

updates a

“shape

model”


Device Setup

Compile Kernels

Allocate Memory

Further Processing

Clean up Other Tasks…

Host

Memory Transfer (discrete GPU)

Or Zero Copy (APU OpenCL 1.2)

Or Shared Virtual Memory (APU OpenCL 2.0)

Kernel 1

Kernel 2

Kernel 2_1

Kernel 2_2

Kernel

Compute Device New in OpenCL

2.0: Dynamic

parallelism! A

kernel can

enqueue

another kernel

The OpenCL Execution Model


Open Computer Vision Library

2,500+ algorithms and functions Cross-platform BSD license

High performance Professionally developed 7M+ downloads

• OpenCL is fully integrated in OpenCV

• ~100 most commonly used algorithms optimized with OpenCL

• Can be built without OpenCL SDK installed. Dynamic OpenCL runtime loading

• OpenCL enabled on the official Windows bin pack

• OpenCV pre-commit check includes OpenCL tests

• Very easy to plug in your own kernels using OpenCV plumbing

• In 2.4.x, OpenCL acceleration is a distinct code path

First public release

2000 2013 ~10/2014

v2.0 C++ API

v2.4.3

OpenCL™

2009

v3.0 alpha Transparent API


// initialization

VideoCapture vcap(...);

CascadeClassifier

fd("haar_ff.xml");

Mat frame, frameGray;

vector<Rect> faces;

for(;;){

vcap >> frame;

cvtColor(frame, frameGray,

BGR2GRAY);

equalizeHist(frameGray,

frameGray);

fd.detectMultiScale(frameGr

ay, faces);

}

OCV 2.4: Face detect on CPU

// initialization


ocl::OclCascadeClassifier

fd("haar_ff.xml");

ocl::oclMat frame, frameGray;

Mat frameCpu;

vector<Rect> faces;

for(;;){

vcap >> frameCpu;

frame = frameCpu;

ocl:: cvtColor(frame, frameGray,

BGR2GRAY);

ocl:: equalizeHist(frameGray,

frameGray);

ocl::

fd.detectMultiScale(frameGray,

faces);

}

OCV 2.4: Face detect using OpenCL™

OpenCV 2.4: Similar, but not identical code paths. You will

need to write code explicitly for both CPU and OpenCL

// initialization


CascadeClassifier

fd("haar_ff.xml");

UMat frame, frameGray;

vector<Rect> faces;

for(;;){

vcap >> frame;

cvtColor(frame, frameGray,

BGR2GRAY);

equalizeHist(frameGray,

frameGray);

fd.detectMultiScale(frameGray

, faces);

}

OCV 3.0: Face detect Anywhere!

This code will run, and configure itself

differently on different platforms!

The Need for a Transparent API


UMat:

UMatData:

Reference counts

Dirty bits

Opaque handles (e.g. clBuffer)

CPU data

GPU data

Handles data synchronization efficiently

Mat:

getM

at(

…) getU

Mat(…

) • Easy transition path from 2.x to 3.x. Code that used to work in 2.x,

should still work. Therefore, cv::Mat is still around.

Both Mat and UMat are views into UMatData, which does the heavy lifting

How does OpenCV 3.0 T-API work?


• OpenCL™ provides a non-proprietary API suitable for image processing

and vision applications, that works well on multiple platforms

• OpenCL 2.0 and HSA enable efficient collaboration between CPU and

GPU cores, on equal footing. An evolution that can only be compared to

the one from single core to multi-core CPUs!

• OpenCV contains lots of OpenCL examples that can be a great starting

point for your own projects.

Join the Open Standards evolution!

Concluding Thoughts


The information presented in this document is for informational purposes only and may contain technical inaccuracies,

omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not

limited to product and roadmap changes, component and motherboard version changes, new model and/or product

releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the

like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right

to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify

any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO

RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN

NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES

ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY

OF SUCH DAMAGES.

ATTRIBUTION

© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks

of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. used by

permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.

Disclaimer & Attribution

"computer vision powered by heterogeneous system architecture (hsa)," a presentation from...

Technology

hsa opencl evolution

discrete gpu opencl

x opencl

work opencl

dense data parallelism

putting opencl

sparse data parallelism

adjacent data