accelerating extraordinary user experiences on mobile devices · • high-dynamic range (hdr) and...

© Copyright Khronos Group 2013 -

Accelerating Extraordinary User Experiences on Mobile Devices

Neil Trevett Khronos, President

NVIDIA, Vice President Mobile Content


1990’s 2000’s 2010’s

A New Era in Computing

PC Internet Mobile


Why is Mobile Parallelism Needed?

Courtesy Metaio http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related

State-of-the-art Augmented Reality without GPU Compute

http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related




Augmented Reality with GPU Parallelism

High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence P. Kán, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria

Research today on CUDA equipped laptop PCs

How will this compute capability migrate from high-end PCs to mobile?


Mobile SOC Performance Increases

1

100

CPU

/GPU

AG

GRE

GAT

E PE

RFO

RMA

NCE

2013 2015

Tegra 4 Quad A15

2014 2011

2012

Tegra 2 1st Dual A9 Tegra 2 Dual A9

Logan

10

Core 2 Duo

Parker

Core i5

HTC One X+

Google Nexus 7

100x perf increase in four years

Mobile Device Shipping Dates

Full Kepler GPU CUDA 5.0

OpenGL 4.3

64-bit Denver CPU Maxwell GPU

Tegra 3 Quad A9


Mobile GPU Compute Today (almost) • Kayla = Tegra 3 + discrete Kepler GPU

- Tuned to match Logan GPU performance

• Full set of drivers on Linux - OpenGL ES 3.0 - OpenGL 4.3 - CUDA 5.0


Power is the New Design Limit • The Process Fairy keeps bringing more transistors..

..but the ‘End of Voltage Scaling’ means power is much more of an issue than in the past

In the Good Old Days Leakage was not important, and voltage

scaled with feature size

L’ = L/2 D’ = 1/L2 = 4D f’ = 2f V’ = V/2 E’ = CV2 = E/8 P’ = P

Halve L and get 4x the transistors and 8x the capability for

the same power

The New Reality Leakage has limited threshold voltage,

largely ending voltage scaling

L’ = L/2 D’ = 1/L2 = 4D f’ = ~2f V’ = ~V E’ = CV2 = E/2 P’ = 4P

Halve L and get 4x the transistors and 8x the capability for

4x the power!!


Mobile Thermal Design Point

2-4W 4-7W

6-10W 30-90W

4-5” Screen takes 250-500mW

7” Screen takes 1W

10” Screen takes 1-2W Resolution makes a difference -

the iPad3 screen takes up to 8W!

Typical max system power levels before thermal failure Even as battery technology improves - these thermal limits remain


How to Save Power?

• Much more expensive to MOVE data than COMPUTE data

• Process improvements WIDEN the gap - 10nm process will increase ratio another 4X

• Energy efficiency must be key metric during silicon AND app design - Awareness of where data lives, where

computation happens, how is it scheduled

32-bit Integer Add 1pJ

32-bit Float Operation 7pJ

32-bit Register Write 0.5pJ

Send 32-bits 2mm 24pJ

Send 32-bits Off-chip 50pJ

For 40nm, 1V process

Write 32-bits to LP-DDR2 600pJ


Mobile Parallel Compute Use Case Pipeline … • … largely driven by using camera as sensors

Augmented Reality

Face, Body and Gesture Tracking

Computational Photography and

Videography

3D Scene/Object Reconstruction


Camera Sensor Processing • CPU

- Single processor or Neon SIMD - running fast - Makes heavy use of general memory - Non-optimal performance and power

• GPU - Programmable and flexible - Many way parallelism - run at lower frequency - Efficient image caching close to processors - BUT cycles frames in and out of memory

• Camera ISP (Image Signal Processor) - Little or no programmability - Data flows thru compact hardware pipe - Scan-line-based - no global memory - Best perf/watt

~760 math Ops ~42K vals = 670Kb

300MHz ~250Gops


Dark Silicon • GPUs are much more power efficient than CPUs

- When exploiting data parallelism can x10 as efficient – but can go further…

• Lots of space for transistors on SOC – but can’t turn them all on at same time! - Would exceed Thermal Design Point

• Dark Silicon - specialized hardware – only turned on when needed - Dedicated units can increase locality and parallelism of computation

Power Efficiency

Computation Flexibility

Enabling new mobile experiences requires pushing computation onto GPUs and

dedicated hardware

Dedicated Hardware

GPU Compute

Multi-core CPU

X1

X10

X100


Low Power Environment Scanning • Many sensor use cases would consume too much power to be running 24/7

- Environment aware use cases have to be very low power

• ‘Scanners’ - very low power, always on, detect things in the environment - Trigger the next level of processing capability

ARM 7 1 MIP and accelerometers can detect someone in the vicinity

DSP Low power activation of camera

to detect someone in field of view

GPU GPU acceleration for precision

gesture processing


Advanced Camera Control Use Cases • High-dynamic range (HDR) and computational flash photography

- High-speed burst with individual frame control over exposure and flash

• Rolling shutter elimination - High-precision intra-frame synchronization between camera and motion sensor

• HDR Panorama, photo-spheres - Continuous frame capture with constant exposure and white balance

• Subject isolation and depth detection • High-speed burst with individual frame control over focus

• Time-of-flight or structured light depth camera processing - Aligned stacking of data from multiple sensors

• Augmented Reality - 60Hz, low-latency capture with motion sensor synchronization - Multiple Region of Interest (ROI) capture - Multiple sensors for scene scaling - Detailed feedback on camera operation per frame


Camera Control API Complements Acceleration

Pre-ISP Processing (Bayer Space)

ISP

Post-ISP Processing (YUV Space)

Application

Need Camera API to feed processing pipeline for

advanced use cases

Image/Vision Acceleration APIs ?

Sensor Lens

Flash

New Camera Working Group Call for participation being publicly announced today!

http://www.khronos.org/camera

http://www.khronos.org/camera


Precursor APIs for Camera Control Initiative • FCAM – Open source project

- Capture of stream of camera images with precision control - A pipeline that converts requests into image stream - All parameters packed into the requests - no global state - Programmer has full control over sensor settings for each frame in stream

- Control over focus and flash - No hidden daemon running

- Control ISP - Can access supplemental

statistics from ISP if available

• Android New Camera HAL (2013) - Uses some of these concepts


Camera Control API Usage

Burst control of sensor, flash, lens

with precision timestamps on frames

Application

Statistics Feedback

Output Scalers

and Formatters

Downstream Processing

ISP Image Processing

MEMS Sensors

Precision timestamps on positional sensor

samples

Sensor Lens

Flash EGLStream

and EGL access to sample timestamps

Camera Control API


OpenVX • Vision Hardware Acceleration Layer

- Enables hardware vendors to implement accelerated imaging and vision algorithms

- For use by high-level libraries or apps

• Focus on enabling real-time vision - On mobile and embedded systems

• Diversity of efficient implementations - From programmable processors, through

GPUs to dedicated hardware pipelines

Open source sample implementation

Hardware vendor implementations

OpenCV open source library

Other higher-level CV libraries

Application

Dedicated hardware can help make vision processing performant and low-power enough

for pervasive ‘always-on’ use


OpenVX and OpenCV are Complementary

Governance Open Source

Community Driven No formal specification

Formal specification and full conformance tests

Implemented by hardware vendors

Scope Very wide

1000s of functions of imaging and vision Multiple camera APIs/interfaces

Tight focus on hardware accelerated functions for mobile vision Use external camera API

Conformance No Conformance testing Every vendor implements different subset

Full conformance test suite / process Reliable acceleration platform

Use Case Rapid prototyping Production deployment

Efficiency Memory-based architecture

Each operation reads and writes memory Sub-optimal power / performance

Graph-based execution Optimized nodes and data transfer

Highly efficient


OpenVX Power Efficiency • OpenVX Graph for power and performance efficiency

- Each Node can be implemented in software or accelerated hardware - Nodes may be fused by the implementation - Eliminates transferring the image to and from memory

• EGLStreams can provide data and event interop with other APIs - BUT use of other Khronos APIs are not mandated

• VXU Utility Library provides efficient access to single nodes - Open source implementation – easy way to start using OpenVX

OpenVX Node

OpenVX Node

OpenVX Node

OpenVX Node

Heterogeneous Processing

Native Camera Control


Android Three Layer Ecosystem

API Drivers - Java (SDK) and Native (NDK)

Apps and Games Most use Java, Cutting-edge apps/games use native APIs

Middleware and Apps Engines Use native APIs for power and performance

ISVs

Most developers use Java SDK for easy development and portability BUT

Leading edge apps, games and middleware need power efficient and performance of native C

SOCs


General Native GPU Compute

APIs for Android GPU Compute and Graphics

Graphics GPU Compute

Java

Native

RenderScript Run performance critical sections as native C. Automatically offload C code segments to the

GPU if possible

Java Binding to OpenGL ES (similar to JSR239)

GLSL shaders for GPGPU compute ? Full ANSI C programming of heterogeneous CPUs and GPUs


OpenCL as Parallel Compute Foundation

C++ syntax/compiler

extensions

OpenCL HLM Aparapi Java language extensions for

parallelism

JavaScript binding to OpenCL for initiation of

OpenCL C kernels

WebCL River Trail Language extensions

to JavaScript

C++ AMP Shevlin Park Uses Clang and LLVM

PyOpenCL Python wrapper

around OpenCL

CUDA or DirectCompute may also be used as compiler targets – BUT OpenCL provides cross-platform, cross-vendor coverage


Visual Sensor Revolution • Single sensor RGB cameras are just the start of the mobile visual revolution

- IR sensors – LEAP Motion, eye-trackers

• Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras - Even stereo pair can enable object scaling and enhanced depth extraction - Plenoptic Field processing needs FFTs and ray-casting

• Hybrid visual sensing solutions - Different sensors mixed for different distances and lighting conditions

• GPUs today – more dedicated ISPs tomorrow?

Stereo Camera LG Electronics

Plenoptic Array Pelican imaging

Capri Structured Light 3D Camera PrimeSense


Paths to Low Power Parallelism • Dark Silicon – use silicon area to implement range of power/perf processor types • Localize and parallelize – port code from CPUs -> GPUs -> DSPs -> Hardware • Smart triggering of compute resources only when needed – sensor fusion • Graph-based processing APIs – node fusion to avoid memory round trips • Developer instrumentation for power!

- Dynamic and feedback-driven software power optimization - Instrumentation for energy-aware compilers and profilers - Most compilers just look at one thread, take a more global view - Power optimizing compiler back-end / installers

• Neil Trevett • [email protected]

mailto:[email protected]

accelerating extraordinary user experiences on mobile devices · • high-dynamic range (hdr) and...

Documents