embedded opencv acceleration

Embedded OpenCV AccelerationDario Pennisi

Introduction

Open-Source Computer Vision Library

Over 2500 algorithms and functionsCross platform, portable API

Windows, Linux, OS X, Android, iOS

Real Time performanceBSD licenseProfessionally developed and maintained

19/04/2023

History

Launched in 1999 by IntelShowcasing Intel Performance Library

First Alpha released in 20001.0 version released in 2006Corporate support by Willow Garage in 20082.0 version released in 2009

Improved c++ interfacesReleases each 6 months

In 2014 taken over by ItSeez3.0 in beta now

Drop C API support

19/04/2023

Building blocks to ease vision applications

OpenCV

Application structure

19/04/2023

Image Retrieval

Pre Processin

g

Feature Extraction

Object Detection

highgui imgproc features2d objdetect

RecognitionReconstruction

AnalisysDecision Making

calib3d video stitching ml

Acceleration

SSE/AVX/NEON

OpenCLCUDA

Environment

19/04/2023

Application

C++ Java Python

OpenCV

cv::parallel_for_

Threading APIs

OS

Concurrency

GCD

TBBOpenM

PCStrip

es

Desktop vs Embedded

19/04/2023

Desktop Industrial Embedded

Cores/Threads

8/16 4/4

Core Frequency

>4GHz >1.4GHz

L1 Cache 32K+32K 32K+32K

L2 Cache 256K per core 2M shared

L3 Cache 20M -

DDR Controllers

4x64 Bit DDR4 @ 1066 MHz

2x32 Bit DDR3 @ 800MHz

TDP 140W (CPU) 10W (SoC)

GPU cores 2880 1+4+16

Dimensioning system is fundamentalUnderstand your algorithmCarefully choose your toolboxEmbedded means no chance for “one size fits all”

System Engineering

19/04/2023

Acceleration Strategies

Optimize AlgorithmsProfileOptimizePartition (CPU/GPU/DSP)

FPGA accelerationHigh level synthesisCustom DSPRTL coding

Brute ForceIncrease number of CPUsIncrease CPU Frequency

Accelerated librariesNEON

OpenCL/CUDA

19/04/2023

Bottlenecks

19/04/2023

Know your enemy

Memory

Access to external memory is expensiveCPU load instructions are slowMemory has LatencyMemory bandwidth is shared among CPUs

CachePrevents CPU to access external memoryData and instruction

19/04/2023

Disordered accesses

What happens when we have cache miss?

Fetch data from same memory row 13 clocksFetch data from a different row 23 clocks

Cache line usually 32 bytes8 clocks to fill a line (32 bit data bus)

Memory bandwidth Efficiency38% on same row26% on different row19/04/2023

Bottlenecks - Cache

1920x1080 YCbCr 4:2:2 (Full HD) 4MBDouble the size of the biggest ARM L2 cache

1280x720 YCbCr 4:2:2 (HD) 1.8 MBJust fits L2 Cache… ok if reading and writing to the same frame

720x576 YCbCr 4:2:2 (SD) 800KB2 images in L2 cache…

19/04/2023

OpenCV Algorithms

Mostly designed for PCsWell structuredGeneral purposeOptimized functions for SSE/AVXRelatively optimizedSmall number of accelerated functions• NEON• Cuda (nVidia GPU/Tegra)• OpenCL (GPU, Multicore processors)

19/04/2023

Multicore ARM/NEON

NEON SIMD instructions work on vectors of registers

Load-process-store philosophyLoad/store costs 1 cycle only if in L1 cache• 4-12 cycles if in L2• 25 to 35 cycles on L2 cache miss

SIMD instructions can take from 1 to 5 clocks

Fast clock useless on big datasets/small computation

19/04/2023

Generic DSP

Very similar to ARM/NEONHigh speed pipeline impaired by inefficient memory access subsystemWhen smart DMA is available it is very complex to program

When DSP is integrated in SoC it shares ARM’s bandwidth

19/04/2023

OpenCL on GPU

OpenCL on Vivante GC2000Claimed capability up to 16 GFLOPS

Real Applicationsonly on internal registers: 13.8 GFLOPScomputing 1000x1000 matrix: 600 MFLOPS

Bandwidth and inefficiencies:Only 1K local memory and 64 byte memory cache

19/04/2023

OpenCL on FPGA

Same code can run on FPGA and GPUTransform selected functions in hardwareAutomated memory access coalescingEach function requires dedicated logic

Large FPGAs requiredPartial reconfiguration may solve this

Significant compilation time

19/04/2023

HLS on FPGA

High Level SynthesisConvert C to hardware

HLS requires Code to be heavily modified

Pragmas to instruct compilerCode restructuringNot portable anymore

Each function requires dedicated logicLarge FPGAs requiredPartial reconfiguration may solve this

Significant compilation time19/04/2023

A different approach

Demanding algorithms on low cost/power HW

19/04/2023

Algorithm Analysis

Memory Access Pattern

Data intensive processin

g

Decision Making

DMADSP

NEONARM

program

Custom Instructio

n(RTL)

External co-processing

19/04/2023

ARM

GPU

Memory

FPGA Memory

PCIe

ARM MemoryFPG

A

Co-processor details

FPGA Co-ProcessorSeparate memory• Adds bandwidth• Reduces access conflict

Algorithm aware DMA• Access memory in ordered way• Add caching through embedded

RAM

Algorithm specific processors• HLS/OpenCL synthesized IP blocks• DSP with custom instructions• Hardcoded IP blocks

19/04/2023

Block capture

DPRAM(s)

DPRAM(s)

DSP core (s)

Memory

DMA Process

or

Block capture

DPRAM(s)

DMA Process

or

DPRAM(s)

DSP core (s)

DPRAM DPRAM

DSP core/IP Block

Block capture

ARMARM

Co-processor details

Flex DMADedicated processor with DMA custom instructionSoftware defined memory access pattern

Block CaptureExtracts data for each tile

DPRAMLocal, high speed cache

DSP CoreDedicated processor with Algorithm specific custom instructions19/04/2023

Block capture

DPRAM(s)

DPRAM(s)

DSP core (s)

Memory

Flex DMA

Block capture

DPRAM(s)

Flex DMA

DPRAM(s)

DSP core (s)

DPRAM DPRAM

DSP core/IP Block

Block capture

ARMARM

Flex DMA

Flex DMA

Block capture

Block capture

Block capture

DPRAM(s)

DPRAM(s)

DPRAM(s)

DPRAM(s)DPRAM DPRAM

DSP core (s)DSP core (s)DSP core/IP Block

OpenVX

Environment

19/04/2023

Application

C++ Java Python

OpenCV

cv::parallel_for_

Threading APIs

OS

Concurrency

GCD

TBBOpenM

PCStrip

es

Acceleration

SSE/AVX/NEON

OpenCL

CUDA

FPGA

OpenVX

19/04/2023

MemoryMemory Node2Memory Node1 Node2 MemoryMemory Node1

OpenVX Graph Manager

Graph ConstructionAllocates resourcesLogical representation of algorithm

Graph ExecutionConcatenate nodes avoiding memory storage

Tiling extensionsSingle node execution can be split in multiple tilesMultiple accelerators executing single task in parallel

19/04/2023

Summary

• OpenCV today is mainly PC oriented.

• ARM, Cuda, OpenCL support growing

• Existing acceleration only on selected functions

• Embedded CV requires good partitioning among resources

• When ASSPs are not enough FPGAs are key

• OpenVX provides a consistent HW acceleration platform, not only for OpenCV

19/04/2023

What we learnt

Questions

19/04/2023

Thank you

19/04/2023

embedded opencv acceleration

Documents

l2 cache misssimd instructions

memory row

clocks cache line

mbjust fits l2 cache

clocksfetch data

expensivecpu load instructions

frame 720x576 ycbcr

beta nowdrop c api support