"computer vision powered by heterogeneous system architecture (hsa)," a presentation from...
TRANSCRIPT
Copyright © 2014 AMD 1
Dr. Harris Gasparakis
5/29/2014
Computer Vision Powered by
Heterogeneous System Architecture
(HSA)
Copyright © 2014 AMD 2
• DEVELOPING EMBEDDED VISION APPLICATIONS: THE PROPRIETARY API LEGACY.
• THE RISE OF GPUS: DENSE DATA PARALLELISM AND CACHE-COHERENT SIMD
• DO WE STILL NEED CPUs?
• THE HETEROGENEOUS FUTURE OF VISION: OPENCL™, HSA
• OPENCL EVOLUTION • OPENCL 1.X
• OPENCL 2.X AND HSA
• THE OPENCL EXECUTION MODEL
• MONSTERS IN THE ORCHESTRA, CES 2014
• PUTTING OPENCL™ 2.0 TO WORK
• OPENCL™ IN OPENCV • OPENCV 3.0: THE TRANSPARENT API
• HOW DOES IT WORK?
• CONCLUDING THOUGHTS
AGENDA
Copyright © 2014 AMD 3
• Choose HW and SW platform
– A multitude of devices of different
capabilities and strengths!
– A multitude of algorithms of different
requirements! Data Parallel (Y/N/M?)
• Highly specialized programmers
• Non-portable programs, with high
platform risk
Is Image
Processing/
Vision your
core IP?
Use somebody’s SDK
or end product
No
Yes
This talk is
not for you
Developing Embedded Vision Applications
Copyright © 2014 AMD 4
Sobel
Just SIMD is NOT
good enough for
good GPU
acceleration,
contrary to
popular wisdom
Merge kernels
Split kernels
• “GPU is great for SIMD: Single Instruction Multiple Data”
• Image Processing = Dense Data Parallelism
• Same calculation (e.g. calculate edge strength) for all pixels.
• Let’s call this CC-SIMD: Cache Coherent SIMD
adjacent threads load adjacent data
Also be careful of:
1. Too simple algorithms (non enough math per memory transfer)
2. Too much complexity per kernel (high register pressure)
The rise of GPUs: Dense Data Parallelism
Copyright © 2014 AMD 5
Features
• Extensive set of CPU libraries
• Also, Vision (image understanding) is a “dense to sparse transition”
• Sparsity is not a GPU’s friend.
• OpenCL 2.0 solves this problem much more optimally
(and with less code)
Do we still need CPUs?
Copyright © 2014 AMD 6
CPU
GPU
Audio
Processor
ISP: Image
Signal
Processing
Fixed
Function
Acctr
Encode
Decode
Sh
are
d M
em
ory
DSP
Other!
The Heterogeneous Future of Vision
THE RIGHT IP FOR THE RIGHT TASK!
• CPU is great for serial tasks
• Lower latency
• Good branching performance
• Lower throughput
• Good at Task parallelism
• Better for Sparse Data Parallelism
• GPU excels at data parallel problems
• High throughput
• Possibly High latency
• Good at “Dense Data Parallelism”
• Increasingly better at task parallelism
(concurrent kernel execution, and
OpenCL 2.0 Dynamic parallelism)
An efficient Heterogeneous System Architecture would be optimal (e.g. GFLOPS/$/W)
Copyright © 2014 AMD 7 © Copyright 201 HSA Foundation. All Rights Reserved.
Founders
Promoters
Supporters
Contributors
Academic
Copyright © 2014 AMD 8
• OpenCLTM: Khronos Software API
• Cross-platform (Windows, Linux, Mac OS, etc.)
• Multi-vendor (AMD, Apple, IBM, Intel, NVIDIA, etc.),
with maturing support
• Multicore CPU, discrete GPU, integrated GPU (aka
APU), DSP, FPGA, etc.
• HSA: Heterogeneous System Architecture, an industry
standard specification
• OpenCL 2.0 introduces HSA features
• Open Source also helps!
• OpenCV, featuring OpenCL acceleration
Open Standards
Copyright © 2014 AMD 9
OpenCL™ Evolution: Discrete GPU
OpenCL was invented as
an open standards high
level API for GPU
compute, first on discrete
graphics cards
OpenCL abstracts:
• Data management across
multiple memory spaces • Memory buffers / Images
• Compute Instructions • “Kernels”
• Execution on “compute
units” CU.
PCIe
™
CU CU CU CU CU CU CU
CU CU CU CU
GPU device Memory
GPU
Main memory
Host
Memory
PCIe
Memory
(pinned)
CPU
…
Copyright © 2014 AMD 10
OpenCL™ Evolution: Legacy APU
CU CU …
CU CU … CU CU CU CU
GPU
APU: Physical
Integration: CPU
and GPU on
same die
OpenCL (1.x)
works also on
APUs
• Device memory
is (part of) main
memory, but
still must use
memory
buffers!
Host memory Device Visible
Host Memory
Device memory Host Visible
Device Memory
Main memory
CPU
…
Copyright © 2014 AMD 11
OpenCL™ Evolution: HSA Enabled APU
Unified Coherent Memory enables data sharing across all processors and GPU compute units
OpenCL™ 2.x: No need to use memory buffers, just use data pointers, just like you would do on the CPU.
Unified (Bidirectionally Coherent, pageable) Virtual Memory
CU CU …
CU CU …
CU CU CU CU
GPU CPU
Cache Cache
Physical Memory
Copyright © 2014 AMD 12
A 360o x 90o immersive gesture-enabled experience, enabled by OpenCL (and OpenCV)
Monsters in the Orchestra, AMD CES 2014
Copyright © 2014 AMD 13
VISION: Dense to sparse transition
Putting OPENCL™ 2.0 to Work
GPU keypoints
Data changed by a kernel, can be visible by CPU, before
kernel returns requires (fine grain SVM).
CPU
consumes
keypoints
“as they
come”, and
updates a
“shape
model”
Copyright © 2014 AMD 14
Device Setup
Compile Kernels
Allocate Memory
Further Processing
Clean up Other Tasks…
Host
Memory Transfer (discrete GPU)
Or Zero Copy (APU OpenCL 1.2)
Or Shared Virtual Memory (APU OpenCL 2.0)
Kernel 1
Kernel 2
Kernel 2_1
Kernel 2_2
Kernel
Compute Device New in OpenCL
2.0: Dynamic
parallelism! A
kernel can
enqueue
another kernel
The OpenCL Execution Model
Copyright © 2014 AMD 15
Open Computer Vision Library
2,500+ algorithms and functions Cross-platform BSD license
High performance Professionally developed 7M+ downloads
• OpenCL is fully integrated in OpenCV
• ~100 most commonly used algorithms optimized with OpenCL
• Can be built without OpenCL SDK installed. Dynamic OpenCL runtime loading
• OpenCL enabled on the official Windows bin pack
• OpenCV pre-commit check includes OpenCL tests
• Very easy to plug in your own kernels using OpenCV plumbing
• In 2.4.x, OpenCL acceleration is a distinct code path
First public release
2000 2013 ~10/2014
v2.0 C++ API
v2.4.3
OpenCL™
2009
v3.0 alpha Transparent API
Copyright © 2014 AMD 16
// initialization
VideoCapture vcap(...);
CascadeClassifier
fd("haar_ff.xml");
Mat frame, frameGray;
vector<Rect> faces;
for(;;){
vcap >> frame;
cvtColor(frame, frameGray,
BGR2GRAY);
equalizeHist(frameGray,
frameGray);
fd.detectMultiScale(frameGr
ay, faces);
}
OCV 2.4: Face detect on CPU
// initialization
VideoCapture vcap(...);
ocl::OclCascadeClassifier
fd("haar_ff.xml");
ocl::oclMat frame, frameGray;
Mat frameCpu;
vector<Rect> faces;
for(;;){
vcap >> frameCpu;
frame = frameCpu;
ocl:: cvtColor(frame, frameGray,
BGR2GRAY);
ocl:: equalizeHist(frameGray,
frameGray);
ocl::
fd.detectMultiScale(frameGray,
faces);
}
OCV 2.4: Face detect using OpenCL™
OpenCV 2.4: Similar, but not identical code paths. You will
need to write code explicitly for both CPU and OpenCL
// initialization
VideoCapture vcap(...);
CascadeClassifier
fd("haar_ff.xml");
UMat frame, frameGray;
vector<Rect> faces;
for(;;){
vcap >> frame;
cvtColor(frame, frameGray,
BGR2GRAY);
equalizeHist(frameGray,
frameGray);
fd.detectMultiScale(frameGray
, faces);
}
OCV 3.0: Face detect Anywhere!
This code will run, and configure itself
differently on different platforms!
The Need for a Transparent API
Copyright © 2014 AMD 17
UMat:
UMatData:
Reference counts
Dirty bits
Opaque handles (e.g. clBuffer)
CPU data
GPU data
Handles data synchronization efficiently
Mat:
getM
at(
…) getU
Mat(…
) • Easy transition path from 2.x to 3.x. Code that used to work in 2.x,
should still work. Therefore, cv::Mat is still around.
Both Mat and UMat are views into UMatData, which does the heavy lifting
How does OpenCV 3.0 T-API work?
Copyright © 2014 AMD 18
• OpenCL™ provides a non-proprietary API suitable for image processing
and vision applications, that works well on multiple platforms
• OpenCL 2.0 and HSA enable efficient collaboration between CPU and
GPU cores, on equal footing. An evolution that can only be compared to
the one from single core to multi-core CPUs!
• OpenCV contains lots of OpenCL examples that can be a great starting
point for your own projects.
Join the Open Standards evolution!
Concluding Thoughts
Copyright © 2014 AMD 19
The information presented in this document is for informational purposes only and may contain technical inaccuracies,
omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product
releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the
like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right
to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify
any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN
NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES
ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.
ATTRIBUTION
© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks
of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. used by
permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.
Disclaimer & Attribution