accelerating extraordinary user experiences on mobile devices · • high-dynamic range (hdr) and...

of 25 /25
© Copyright Khronos Group 2013 - Page 1 Accelerating Extraordinary User Experiences on Mobile Devices Neil Trevett Khronos, President NVIDIA, Vice President Mobile Content

Author: buikhanh

Post on 23-Jul-2019

216 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Copyright Khronos Group 2013 - Page 1

    Accelerating Extraordinary User Experiences on Mobile Devices

    Neil Trevett Khronos, President

    NVIDIA, Vice President Mobile Content

  • Copyright Khronos Group 2013 - Page 2

    1990s 2000s 2010s

    A New Era in Computing

    PC Internet Mobile

  • Copyright Khronos Group 2013 - Page 3

    Why is Mobile Parallelism Needed?

    Courtesy Metaio http://www.youtube.com/watch?v=xw3M-TNOo44&feature=related

    State-of-the-art Augmented Reality without GPU Compute

    http://www.youtube.com/watch?v=xw3M-TNOo44&feature=relatedhttp://www.youtube.com/watch?v=xw3M-TNOo44&feature=relatedhttp://www.youtube.com/watch?v=xw3M-TNOo44&feature=related
  • Copyright Khronos Group 2013 - Page 4

    Augmented Reality with GPU Parallelism

    High-Quality Reflections, Refractions, and Caustics in Augmented Reality and their Contribution to Visual Coherence P. Kn, H. Kaufmann, Institute of Software Technology and Interactive Systems, Vienna University of Technology, Vienna, Austria

    Research today on CUDA equipped laptop PCs

    How will this compute capability migrate from high-end PCs to mobile?

  • Copyright Khronos Group 2013 - Page 5

    Mobile SOC Performance Increases

    1

    100

    CPU

    /GPU

    AG

    GRE

    GAT

    E PE

    RFO

    RMA

    NCE

    2013 2015

    Tegra 4 Quad A15

    2014 2011

    2012

    Tegra 2 1st Dual A9 Tegra 2 Dual A9

    Logan

    10

    Core 2 Duo

    Parker

    Core i5

    HTC One X+

    Google Nexus 7

    100x perf increase in four years

    Mobile Device Shipping Dates

    Full Kepler GPU CUDA 5.0

    OpenGL 4.3

    64-bit Denver CPU Maxwell GPU

    Tegra 3 Quad A9

  • Copyright Khronos Group 2013 - Page 6

    Mobile GPU Compute Today (almost) Kayla = Tegra 3 + discrete Kepler GPU

    - Tuned to match Logan GPU performance Full set of drivers on Linux

    - OpenGL ES 3.0 - OpenGL 4.3 - CUDA 5.0

  • Copyright Khronos Group 2013 - Page 7

    Power is the New Design Limit The Process Fairy keeps bringing more transistors..

    ..but the End of Voltage Scaling means power is much more of an issue than in the past

    In the Good Old Days Leakage was not important, and voltage

    scaled with feature size

    L = L/2 D = 1/L2 = 4D f = 2f V = V/2 E = CV2 = E/8 P = P

    Halve L and get 4x the transistors and 8x the capability for

    the same power

    The New Reality Leakage has limited threshold voltage,

    largely ending voltage scaling

    L = L/2 D = 1/L2 = 4D f = ~2f V = ~V E = CV2 = E/2 P = 4P

    Halve L and get 4x the transistors and 8x the capability for

    4x the power!!

  • Copyright Khronos Group 2013 - Page 8

    Mobile Thermal Design Point

    2-4W 4-7W

    6-10W 30-90W

    4-5 Screen takes 250-500mW

    7 Screen takes 1W

    10 Screen takes 1-2W Resolution makes a difference -

    the iPad3 screen takes up to 8W!

    Typical max system power levels before thermal failure Even as battery technology improves - these thermal limits remain

  • Copyright Khronos Group 2013 - Page 9

    How to Save Power?

    Much more expensive to MOVE data than COMPUTE data

    Process improvements WIDEN the gap - 10nm process will increase ratio another 4X

    Energy efficiency must be key metric during silicon AND app design - Awareness of where data lives, where

    computation happens, how is it scheduled

    32-bit Integer Add 1pJ

    32-bit Float Operation 7pJ

    32-bit Register Write 0.5pJ

    Send 32-bits 2mm 24pJ

    Send 32-bits Off-chip 50pJ

    For 40nm, 1V process

    Write 32-bits to LP-DDR2 600pJ

  • Copyright Khronos Group 2013 - Page 10

    Mobile Parallel Compute Use Case Pipeline largely driven by using camera as sensors

    Augmented Reality

    Face, Body and Gesture Tracking

    Computational Photography and

    Videography

    3D Scene/Object Reconstruction

  • Copyright Khronos Group 2013 - Page 11

    Camera Sensor Processing CPU

    - Single processor or Neon SIMD - running fast - Makes heavy use of general memory - Non-optimal performance and power

    GPU - Programmable and flexible - Many way parallelism - run at lower frequency - Efficient image caching close to processors - BUT cycles frames in and out of memory

    Camera ISP (Image Signal Processor) - Little or no programmability - Data flows thru compact hardware pipe - Scan-line-based - no global memory - Best perf/watt

    ~760 math Ops ~42K vals = 670Kb

    300MHz ~250Gops

  • Copyright Khronos Group 2013 - Page 12

    Dark Silicon GPUs are much more power efficient than CPUs

    - When exploiting data parallelism can x10 as efficient but can go further Lots of space for transistors on SOC but cant turn them all on at same time!

    - Would exceed Thermal Design Point Dark Silicon - specialized hardware only turned on when needed

    - Dedicated units can increase locality and parallelism of computation

    Power Efficiency

    Computation Flexibility

    Enabling new mobile experiences requires pushing computation onto GPUs and

    dedicated hardware

    Dedicated Hardware

    GPU Compute

    Multi-core CPU

    X1

    X10

    X100

  • Copyright Khronos Group 2013 - Page 13

    Low Power Environment Scanning Many sensor use cases would consume too much power to be running 24/7

    - Environment aware use cases have to be very low power Scanners - very low power, always on, detect things in the environment

    - Trigger the next level of processing capability

    ARM 7 1 MIP and accelerometers can detect someone in the vicinity

    DSP Low power activation of camera

    to detect someone in field of view

    GPU GPU acceleration for precision

    gesture processing

  • Copyright Khronos Group 2013 - Page 14

    Advanced Camera Control Use Cases High-dynamic range (HDR) and computational flash photography

    - High-speed burst with individual frame control over exposure and flash Rolling shutter elimination

    - High-precision intra-frame synchronization between camera and motion sensor HDR Panorama, photo-spheres

    - Continuous frame capture with constant exposure and white balance Subject isolation and depth detection

    High-speed burst with individual frame control over focus

    Time-of-flight or structured light depth camera processing - Aligned stacking of data from multiple sensors

    Augmented Reality - 60Hz, low-latency capture with motion sensor synchronization - Multiple Region of Interest (ROI) capture - Multiple sensors for scene scaling - Detailed feedback on camera operation per frame

  • Copyright Khronos Group 2013 - Page 15

    Camera Control API Complements Acceleration

    Pre-ISP Processing (Bayer Space)

    ISP

    Post-ISP Processing (YUV Space)

    Application

    Need Camera API to feed processing pipeline for

    advanced use cases

    Image/Vision Acceleration APIs ?

    Sensor Lens

    Flash

    New Camera Working Group Call for participation being publicly announced today!

    http://www.khronos.org/camera

    http://www.khronos.org/camera
  • Copyright Khronos Group 2013 - Page 16

    Precursor APIs for Camera Control Initiative FCAM Open source project

    - Capture of stream of camera images with precision control - A pipeline that converts requests into image stream - All parameters packed into the requests - no global state - Programmer has full control over sensor settings for each frame in stream

    - Control over focus and flash - No hidden daemon running

    - Control ISP - Can access supplemental

    statistics from ISP if available

    Android New Camera HAL (2013) - Uses some of these concepts

  • Copyright Khronos Group 2013 - Page 17

    Camera Control API Usage

    Burst control of sensor, flash, lens

    with precision timestamps on frames

    Application

    Statistics Feedback

    Output Scalers

    and Formatters

    Downstream Processing

    ISP Image Processing

    MEMS Sensors

    Precision timestamps on positional sensor

    samples

    Sensor Lens

    Flash EGLStream

    and EGL access to sample timestamps

    Camera Control API

  • Copyright Khronos Group 2013 - Page 18

    OpenVX Vision Hardware Acceleration Layer

    - Enables hardware vendors to implement accelerated imaging and vision algorithms

    - For use by high-level libraries or apps Focus on enabling real-time vision

    - On mobile and embedded systems Diversity of efficient implementations

    - From programmable processors, through GPUs to dedicated hardware pipelines

    Open source sample implementation

    Hardware vendor implementations

    OpenCV open source library

    Other higher-level CV libraries

    Application

    Dedicated hardware can help make vision processing performant and low-power enough

    for pervasive always-on use

  • Copyright Khronos Group 2013 - Page 19

    OpenVX and OpenCV are Complementary

    Governance Open Source

    Community Driven No formal specification

    Formal specification and full conformance tests

    Implemented by hardware vendors

    Scope Very wide

    1000s of functions of imaging and vision Multiple camera APIs/interfaces

    Tight focus on hardware accelerated functions for mobile vision Use external camera API

    Conformance No Conformance testing Every vendor implements different subset Full conformance test suite / process

    Reliable acceleration platform

    Use Case Rapid prototyping Production deployment

    Efficiency Memory-based architecture

    Each operation reads and writes memory Sub-optimal power / performance

    Graph-based execution Optimized nodes and data transfer

    Highly efficient

  • Copyright Khronos Group 2013 - Page 20

    OpenVX Power Efficiency OpenVX Graph for power and performance efficiency

    - Each Node can be implemented in software or accelerated hardware - Nodes may be fused by the implementation - Eliminates transferring the image to and from memory

    EGLStreams can provide data and event interop with other APIs - BUT use of other Khronos APIs are not mandated

    VXU Utility Library provides efficient access to single nodes - Open source implementation easy way to start using OpenVX

    OpenVX Node

    OpenVX Node

    OpenVX Node

    OpenVX Node

    Heterogeneous Processing

    Native Camera Control

  • Copyright Khronos Group 2013 - Page 21

    Android Three Layer Ecosystem

    API Drivers - Java (SDK) and Native (NDK)

    Apps and Games Most use Java, Cutting-edge apps/games use native APIs

    Middleware and Apps Engines Use native APIs for power and performance

    ISVs

    Most developers use Java SDK for easy development and portability BUT

    Leading edge apps, games and middleware need power efficient and performance of native C

    SOCs

  • Copyright Khronos Group 2013 - Page 22

    General Native GPU Compute

    APIs for Android GPU Compute and Graphics

    Graphics GPU Compute

    Java

    Native

    RenderScript Run performance critical sections as native C. Automatically offload C code segments to the

    GPU if possible

    Java Binding to OpenGL ES (similar to JSR239)

    GLSL shaders for GPGPU compute ? Full ANSI C programming of heterogeneous CPUs and GPUs

  • Copyright Khronos Group 2013 - Page 23

    OpenCL as Parallel Compute Foundation

    C++ syntax/compiler

    extensions

    OpenCL HLM Aparapi Java language extensions for

    parallelism

    JavaScript binding to OpenCL for initiation of

    OpenCL C kernels

    WebCL River Trail Language extensions

    to JavaScript

    C++ AMP Shevlin Park Uses Clang and LLVM

    PyOpenCL Python wrapper

    around OpenCL

    CUDA or DirectCompute may also be used as compiler targets BUT OpenCL provides cross-platform, cross-vendor coverage

  • Copyright Khronos Group 2013 - Page 24

    Visual Sensor Revolution Single sensor RGB cameras are just the start of the mobile visual revolution

    - IR sensors LEAP Motion, eye-trackers Multi-sensors: Stereo pairs -> Plenoptic array -> Depth cameras

    - Even stereo pair can enable object scaling and enhanced depth extraction - Plenoptic Field processing needs FFTs and ray-casting

    Hybrid visual sensing solutions - Different sensors mixed for different distances and lighting conditions

    GPUs today more dedicated ISPs tomorrow?

    Stereo Camera LG Electronics

    Plenoptic Array Pelican imaging

    Capri Structured Light 3D Camera PrimeSense

  • Copyright Khronos Group 2013 - Page 25

    Paths to Low Power Parallelism Dark Silicon use silicon area to implement range of power/perf processor types Localize and parallelize port code from CPUs -> GPUs -> DSPs -> Hardware Smart triggering of compute resources only when needed sensor fusion Graph-based processing APIs node fusion to avoid memory round trips Developer instrumentation for power!

    - Dynamic and feedback-driven software power optimization - Instrumentation for energy-aware compilers and profilers - Most compilers just look at one thread, take a more global view - Power optimizing compiler back-end / installers

    Neil Trevett [email protected]

    mailto:[email protected] Extraordinary User Experiences on Mobile DevicesA New Era in ComputingWhy is Mobile Parallelism Needed?Augmented Reality with GPU ParallelismMobile SOC Performance IncreasesMobile GPU Compute Today (almost) Power is the New Design LimitMobile Thermal Design PointHow to Save Power?Mobile Parallel Compute Use Case Pipeline Camera Sensor ProcessingDark SiliconLow Power Environment ScanningAdvanced Camera Control Use CasesCamera Control API Complements AccelerationPrecursor APIs for Camera Control InitiativeCamera Control API UsageOpenVXOpenVX and OpenCV are ComplementaryOpenVX Power EfficiencyAndroid Three Layer EcosystemAPIs for Android GPU Compute and GraphicsOpenCL as Parallel Compute FoundationVisual Sensor RevolutionPaths to Low Power Parallelism