"making computer vision software run fast on your embedded platform," a presentation from...

34
Copyright © 2016 LUXOFT 1 Alexey Rybakov, LUXOFT May 3, 2016 Making Computer Vision Software Run Fast on Your Embedded Platform Art and Science of Optimization

Upload: embedded-vision-alliance

Post on 15-Apr-2017

623 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 1

Alexey Rybakov, LUXOFT

May 3, 2016

Making Computer Vision Software Run

Fast on Your Embedded Platform Art and Science of Optimization

Page 2: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 2

Global Software Engineering:

• Low-Power GPU Software

• Custom Vision Software

Why LUXOFT is Giving This Talk

10,000+ Luxoft software engineers

Page 3: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 3

• Obstruction Removal for Drones

• CAFFE on ARM Mali

• OpenCV on ImgTec PowerVR

• HDR Encoding on GPU-based

• Low-power Motion Stabilization

• GPU-optimized 4K VP9 video codec

• See demos at our booth

Our Optimization Projects Covered in This Talk

Drone Vision

Fast

OpenCV

HDR on GPU Caffe on GPU

Stabilization Fast 4K Codecs

Page 4: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 4

• Qualifying question: Who Develops Computer Vision Software?

• Typical situations in embedded SW development:

• Great new algorithm Implement

• Implementation platform: Desktop-class Embedded*

• Decision making: Delayed Real-time*

• Performance: Low FPS High FPS*

Poll

* Context of this presentation

Page 5: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 5

• Need: reliable, real-time, on-device, decision-making from visual

data...implemented on a constrained HW platform (with exotic architecture)

• What to do

1. Map CV pipeline onto HW platform

2. Rethink system requirements

3. Rework algorithm logic

4. Use GPU, DSP and other aid (properly!)

5. Code optimization

6. Know your platform inside out

Embedded Vision: Challenges and Opportunities

Page 6: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 6

Map CV Pipeline onto HW Platform

1.

Page 7: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 7

Embedded Vision: Pipeline and Hardware

Page 8: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 8

Evaluate your platform:

• Hardware features and accelerators, slow/fast memory, power management?

• Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks?

• Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides?

• Available CV frameworks: OpenCV, IPP, fastCV, other?

Benchmark your embedded platform vs. reference:

• Run simple tests: data copy, access, vectorization, memory use, energy management

• Test if CV-framework functions are optimized (coverage is often low)

…This will give you measured optimization goal

Study and Test HW Platform

Page 9: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 9

Mapping to Platform: Histogram Example

Histo*

2 ms

Histo

equali-

zation

Apply

LUT

Histo

4.2 ms

Histo

equalization

Apply

LUT Camera

Camera

* Histogram collection on CPU is more than 2 times faster than on GPU

** Histogram equalization is a 1 thread, iterative histogram processing, so

GPU implementation is not reasonable.

16.2 ms

2 MB data transfer (HD frame)

1 KB data transfer 1 KB data transfer

1 KB data transfer

GPU processing

CPU processing

Memory transfers

HOST GPU = 1.33 GB/s

GPU HOST = 0.11 GB/s

SOC: Intel Merrifield platform,

Device: Dell Venue 3840

Option A vs.

Option B

Page 10: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 10

Rethink System Requirements!

2.

Page 11: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 11

• Important concept: “Good enough”

• How does your use case differ from classic/desktop requirements?

Art of “controlled worse”

• What decision latency do you need?

• What resolution/precision?

• Do you need all frame or a region?

Optimize System Requirements

Page 12: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 12

• Universal implementation* Our Drone implementation

• Any motion Linear motion

• Any obstacles Opaque obstacles

• Have only image data Use sensor fusion (gyro)

• More than 100X faster!

Rethink Requirements:

Obstruction Removal, Drone Edition

Camera Output

*MIT CSAIL and Google Research, SIGGRAPH 2015

Page 13: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 13

Rework Algorithm Logic

3.

Page 14: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 14

• Desktop Embedded

• High-Res Downsampling / pyramid

• Color Monochrome or luminance

• Entire frame Regions of Interest only

• ROI cascading example: HOG to DNN

• Every frame 1/N + approximation

• Inter-frame cascading: Detection to Tracking

• Image only Sensor fusion

• Example: gyro + vision for motion est.

• CPU Parallelize for GPU

Algorithm Optimization Opportunities

Page 15: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 15

• Motion Vector Field only for 3x3

(pyramid downsampling)

• Only shift and rotation

• 1000x+ performance

• Real-time 4K UHD on mobile

Optimized Video Stabilization Algorithm

• Motion Vector Field only for 3x3

grid (pyramid downsampling)

• Only shift and rotation

• Inter-frame border reconstruction

• 1000x+ performance

• Real-time 4K UHD on mobile

Page 16: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 16

Use GPU and Other Aid (Properly)

4.

Page 17: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 17

• Good news: computer vision is very parallelizable

• Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part

• GPU: What to do (beyond algorithm-to-platform mapping and reworked logic)

• A few simple rules: memory types, datatypes, workroup size, memory alignment

• Master the art of kernel synchronization: load your cores

• Use GPU pre-optimized libraries, like OpenCV on some platforms

• Master OpenCL

• Also explore available ISP or DSP benefits.

Use GPU. Properly

Page 18: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 18

1. Memory Hierarchy

2. Task Synchronization

• Example of both: Large Matrix Transpose

GPU, Two Key Concepts

Page 19: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 19

Original. All FPS measured on Galaxy S7:

• Run existing DNN framework: CAFFE

• =0.7 FPS (EIGEN OpenCL library)

CPU Optimization (not a through road):

• Optimized version for Android: DNN optimized OpenBLAS:

OpenMP and NEON +2 FPS

GPU Optimizations:

• Better OpenCL implementation on ViennaCL library +0.5 FPS

• Found bottleneck: SGEMM functions

• Rewrite SGEMM (workgroup size, vectorization, etc) +4.5 FPS

Final optimized performance: 5-6 FPS

ARM Mali Accelerated CAFFE

Open Source CPU,

1 thread

Open Source GPU

OpenCL

(ViennaCL)

Open Source CPU

multithreaded,

NEON

LUXOFT

0.7 FPS

1.2 FPS 2.5 FPS 5.4 FPS

Page 20: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 20

ARM Mali Accelerated CAFFE: Benchmarks

Legend

Colors

• FPS

• CPU Load

• Battery Charge

Lines

• CPU

• Optimized GPU

Page 21: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 21

VP9 Video Decoder Optimization for GPU

Parsing &

Entropy

Decode

Motion

Compen

sation

Intra

Prediction

Inverse

Quant

Inverse

Transform

Reconst

ruction

Loop

filtering

• CPU: Superblock-level parallelism

Parsing &

Entropy

Decode

Motion

Compensati

on

Intra

Prediction

Inverse

Quant

Inverse

Transform

Reconstructi

on

Loop

filtering

• GPU: Frame-level parallelism

• Uses more memory Input frame

Input frame Output frame

Output frame

Optimization result: 2x-5x FPS depending on bitrate.

Platforms: AMD, Intel, NVidia SoCs

Original CPU Algorithm

GPU processing

CPU processing

Reworked and Optimized GPU Algorithm

Page 22: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 22

Code Optimization

5.

Page 23: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 23

• Two enemies

1. Computation

2. Data transfers

• Waste of time = waste of energy

Controversial example

ARM compiler does it automatically

Some others don’t

Two Enemies: Code and Data

Don’t calculate - Use table/lookup functions,

- Use polynomial approximations

Use classic techniques - Like loop unrolling,

- Converting to native data types

Don’t move data - Use local and cache memory

- Partition/group DRAM access

Benchmark everything - Compiler computation options

- Memory transfers

Page 24: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 24

OpenCV local contrast for HD camera adjustment in real time

• Existing OpenCV histogram implementations don‘t fit into

1080p frame processing budget (need 16 ms/frame for the entire

algoithm chain to obtain 60 FPS)

Optimization Results

Things to do

• Experiment

• Benchmark

• Chose the best method

OpenCV on ImgTec PowerVR GPU: Histogram Example

Histogram Gathering Method Time, ms

OpenCV histogram (CPU) 7.5 ms

OpenCV histogram (GPU) 4.4 ms

Luxoft-PowerVR (atomic_add to global memory) 0.69 ms

Luxoft-PowerVR (atomic_add to local memory) 7.51 ms

Luxoft-PowerVR (increment at local memory) 3.28 ms

Page 25: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 25

• Example: “memory tiling”

Tiled memory layout may

give 2x-3x performance gain

for vision algorithms:

1 DRAM read vs. 4 DRAM reads

in matrix transpose

Example: Fighting Data Transfers

• Reference you need to obtain or produce

(will vary by CPU/GPU of your choice)

Page 26: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 26

Know Your Platform Inside Out

6.

Page 27: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 27

• Things to do

• Study documentation and optimization guides for your exact HW

• Again, test/benchmark a feature before you critically rely on it

• What works for you

• Modern GPUs and DSPs may implement the entire algorithm in 1 instruction

• What works against you

• Don’t assume everything will work as documented

• “Fast” memory …may be slow (like early versions of Snapdragon)

• Great technology …but no documentation and no code examples (like iOS

Metal for compute)

Platform Specifics

Page 28: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 28

• Motion vector field upsampling, common task for CV

• OpenCL supports bilinear

interpolation of everything

• How to, AMD OpenCL implementation

• AMD has QSAD function – the fastest way to SAD for blocks

• Keep MVF in Image2D

• Use sampler with CLK_FILTER_BILINEAR

Platform Example: AMD GPU for Frame Interpolation

Basic Optimized

Page 29: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 29

iOS Metal Compute Findings:

• No code examples for compute, weak documentation = blackbox

• Only 64 GitHub repos, no serious projects

• xCode profiler does not work with Metal Compute use workarounds: manual timer-based

profiling

• Vector types actually not fully supported by a compiler test everything, then use

workaround: use combined approach with scalars and vectors

Encountered while working on GPU-optimized

JPEG-HDR encoding on iPhone

We still achieved about 3x-4x faster JPEG Encode

on iPhone … just took a lot of extra work

Platform Example: Apple iOS Metal for GPU Compute

Page 30: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 30

Lessons Learned and Resources

!

Page 31: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 31

1. Learn, test, profile, and benchmark every component of your system. Including

compiler. Don’t assume.

2. Don’t port 1:1. Rework requirements and algorithm logic too.

3. GPU and other non-CPU compute architectures may give fantastic results.

4. Use parallelization and computer vision frameworks like OpenCL or OpenCV.

Rewrite critical parts there as needed.

5. Modern HW platforms implement popular algorithms in one function call. Study

platform-specific optimization guides.

6. Sometimes things won’t work as documented. This is normal.

7. Optimization is a mix of art and science. Think outside the box.

Lessons Learned

Page 32: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 32

• Embedded Vision Alliance: http://www.embedded-vision.com/

• Platform optimization guides and blog posts from:

• Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia,

Qualcomm, TI

• Luxoft Computer Vision team: [email protected]

Resources

Page 33: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 33

Thank you!

LUXOFT Presentation R&D Team:

Aleksandr Bobrovnik

Aleksandr Volkov

Alexey Rybakov

Anton Veselov

Artem Galin

Dmitriy Marenkov

Dmitry Ivanov

Ekaterina Popova

Ihor Starepravo

Ildar Valiev

Marat Gilmutdinov

Nikolay Nemcev

Oleksandr Murovanyi

Sergey Fedorov

Valery Bobrov

Viktor Pasoshnikov

Page 34: "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 34

See demos at our booth. And email me too

? Alexey Rybakov

Senior Director, Embedded

LUXOFT, Menlo Park, CA

[email protected]