embedded opencv acceleration
DESCRIPTION
Embedded OpenCV Acceleration. Dario Pennisi. Introduction. Open -Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained. History. - PowerPoint PPT PresentationTRANSCRIPT
Embedded OpenCV AccelerationDario Pennisi
Introduction
Open-Source Computer Vision Library
Over 2500 algorithms and functionsCross platform, portable API
Windows, Linux, OS X, Android, iOS
Real Time performanceBSD licenseProfessionally developed and maintained
19/04/2023
History
Launched in 1999 by IntelShowcasing Intel Performance Library
First Alpha released in 20001.0 version released in 2006Corporate support by Willow Garage in 20082.0 version released in 2009
Improved c++ interfacesReleases each 6 months
In 2014 taken over by ItSeez3.0 in beta now
Drop C API support
19/04/2023
Building blocks to ease vision applications
OpenCV
Application structure
19/04/2023
Image Retrieval
Pre Processin
g
Feature Extraction
Object Detection
highgui imgproc features2d objdetect
RecognitionReconstruction
AnalisysDecision Making
calib3d video stitching ml
Acceleration
SSE/AVX/NEON
OpenCLCUDA
Environment
19/04/2023
Application
C++ Java Python
OpenCV
cv::parallel_for_
Threading APIs
OS
Concurrency
GCD
TBBOpenM
PCStrip
es
Desktop vs Embedded
19/04/2023
Desktop Industrial Embedded
Cores/Threads
8/16 4/4
Core Frequency
>4GHz >1.4GHz
L1 Cache 32K+32K 32K+32K
L2 Cache 256K per core 2M shared
L3 Cache 20M -
DDR Controllers
4x64 Bit DDR4 @ 1066 MHz
2x32 Bit DDR3 @ 800MHz
TDP 140W (CPU) 10W (SoC)
GPU cores 2880 1+4+16
Dimensioning system is fundamentalUnderstand your algorithmCarefully choose your toolboxEmbedded means no chance for “one size fits all”
System Engineering
19/04/2023
Acceleration Strategies
Optimize AlgorithmsProfileOptimizePartition (CPU/GPU/DSP)
FPGA accelerationHigh level synthesisCustom DSPRTL coding
Brute ForceIncrease number of CPUsIncrease CPU Frequency
Accelerated librariesNEON
OpenCL/CUDA
19/04/2023
Bottlenecks
19/04/2023
Know your enemy
Memory
Access to external memory is expensiveCPU load instructions are slowMemory has LatencyMemory bandwidth is shared among CPUs
CachePrevents CPU to access external memoryData and instruction
19/04/2023
Disordered accesses
What happens when we have cache miss?
Fetch data from same memory row 13 clocksFetch data from a different row 23 clocks
Cache line usually 32 bytes8 clocks to fill a line (32 bit data bus)
Memory bandwidth Efficiency38% on same row26% on different row19/04/2023
Bottlenecks - Cache
1920x1080 YCbCr 4:2:2 (Full HD) 4MBDouble the size of the biggest ARM L2 cache
1280x720 YCbCr 4:2:2 (HD) 1.8 MBJust fits L2 Cache… ok if reading and writing to the same frame
720x576 YCbCr 4:2:2 (SD) 800KB2 images in L2 cache…
19/04/2023
OpenCV Algorithms
Mostly designed for PCsWell structuredGeneral purposeOptimized functions for SSE/AVXRelatively optimizedSmall number of accelerated functions• NEON• Cuda (nVidia GPU/Tegra)• OpenCL (GPU, Multicore processors)
19/04/2023
Multicore ARM/NEON
NEON SIMD instructions work on vectors of registers
Load-process-store philosophyLoad/store costs 1 cycle only if in L1 cache• 4-12 cycles if in L2• 25 to 35 cycles on L2 cache miss
SIMD instructions can take from 1 to 5 clocks
Fast clock useless on big datasets/small computation
19/04/2023
Generic DSP
Very similar to ARM/NEONHigh speed pipeline impaired by inefficient memory access subsystemWhen smart DMA is available it is very complex to program
When DSP is integrated in SoC it shares ARM’s bandwidth
19/04/2023
OpenCL on GPU
OpenCL on Vivante GC2000Claimed capability up to 16 GFLOPS
Real Applicationsonly on internal registers: 13.8 GFLOPScomputing 1000x1000 matrix: 600 MFLOPS
Bandwidth and inefficiencies:Only 1K local memory and 64 byte memory cache
19/04/2023
OpenCL on FPGA
Same code can run on FPGA and GPUTransform selected functions in hardwareAutomated memory access coalescingEach function requires dedicated logic
Large FPGAs requiredPartial reconfiguration may solve this
Significant compilation time
19/04/2023
HLS on FPGA
High Level SynthesisConvert C to hardware
HLS requires Code to be heavily modified
Pragmas to instruct compilerCode restructuringNot portable anymore
Each function requires dedicated logicLarge FPGAs requiredPartial reconfiguration may solve this
Significant compilation time19/04/2023
A different approach
Demanding algorithms on low cost/power HW
19/04/2023
Algorithm Analysis
Memory Access Pattern
Data intensive processin
g
Decision Making
DMADSP
NEONARM
program
Custom Instructio
n(RTL)
External co-processing
19/04/2023
ARM
GPU
Memory
FPGA Memory
PCIe
ARM MemoryFPG
A
Co-processor details
FPGA Co-ProcessorSeparate memory• Adds bandwidth• Reduces access conflict
Algorithm aware DMA• Access memory in ordered way• Add caching through embedded
RAM
Algorithm specific processors• HLS/OpenCL synthesized IP blocks• DSP with custom instructions• Hardcoded IP blocks
19/04/2023
Block capture
DPRAM(s)
DPRAM(s)
DSP core (s)
Memory
DMA Process
or
Block capture
DPRAM(s)
DMA Process
or
DPRAM(s)
DSP core (s)
DPRAM DPRAM
DSP core/IP Block
Block capture
ARMARM
Co-processor details
Flex DMADedicated processor with DMA custom instructionSoftware defined memory access pattern
Block CaptureExtracts data for each tile
DPRAMLocal, high speed cache
DSP CoreDedicated processor with Algorithm specific custom instructions19/04/2023
Block capture
DPRAM(s)
DPRAM(s)
DSP core (s)
Memory
Flex DMA
Block capture
DPRAM(s)
Flex DMA
DPRAM(s)
DSP core (s)
DPRAM DPRAM
DSP core/IP Block
Block capture
ARMARM
Flex DMA
Flex DMA
Block capture
Block capture
Block capture
DPRAM(s)
DPRAM(s)
DPRAM(s)
DPRAM(s)DPRAM DPRAM
DSP core (s)DSP core (s)DSP core/IP Block
OpenVX
Environment
19/04/2023
Application
C++ Java Python
OpenCV
cv::parallel_for_
Threading APIs
OS
Concurrency
GCD
TBBOpenM
PCStrip
es
Acceleration
SSE/AVX/NEON
OpenCL
CUDA
FPGA
OpenVX
19/04/2023
MemoryMemory Node2Memory Node1 Node2 MemoryMemory Node1
OpenVX Graph Manager
Graph ConstructionAllocates resourcesLogical representation of algorithm
Graph ExecutionConcatenate nodes avoiding memory storage
Tiling extensionsSingle node execution can be split in multiple tilesMultiple accelerators executing single task in parallel
19/04/2023
Summary
• OpenCV today is mainly PC oriented.
• ARM, Cuda, OpenCL support growing
• Existing acceleration only on selected functions
• Embedded CV requires good partitioning among resources
• When ASSPs are not enough FPGAs are key
• OpenVX provides a consistent HW acceleration platform, not only for OpenCV
19/04/2023
What we learnt
Questions
19/04/2023
Thank you
19/04/2023