lecture 15 ryuzo okada - vision processors for embedded computer vision

© 2014 Toshiba Corporation

Vision Processors for Embedded Computer Vision

Ryuzo Okada Corporate R&D Center, Toshiba Corporation

July 18, 2014

© 2014 Toshiba Corporation 2

• Cameras become ubiquitous

Surveillance Automobile Smartphone

Embedded computer vision

SUBARU Eyesight http://www.subaru.jp/about/technology/story/eyesight/eyesight01.html


• High performance for vision processing

– To provide valuable functions to users

– Real-time processing

• Low power consumption

– To reduce running cost

– Max. few watts for fan-less cooling

• Robustness

– Long term operation: 7-10 years

– Outdoor: -40℃ - 85℃

– Shock-proof

General purpose CPU is not feasible

for embedded computer vision processing

Embedded computer vision: Requirements

High performance/W (e.g. GOPS/W)


• Types of vision processors

• Vision processors for automobiles

– Toshiba’s image recognition LSI, TMPV75 series

• Cloud computing and vision processors

– Surveillance camera

• Future direction and summary

Contents


Logic-circuit-embedded image sensor

• Silicon retina [Mead89]

– simulated the neural layers in the retina using analog circuits

– Early vision processing, e.g. smoothing

• Optical Neurochip [Nitta92]

– achieved a neural NW by optical circuits

– Alphabet recognition

Types of vision processors: (1) vision chip


• Programmable Artificial Retina [Bernard93] Near Sensor Image Processing [Astrom96]

Sensory Processing Element [Ishii96]

– consist of a photodiode (PD) with a digital processing element (PE)

– Massively parallel processing (pixel parallel) realized 1 ms visual servo control

• IVP MAPP [Johansson03] Column-parallel vision chip [Nakabo02]

– PE is assigned for each column of PD array

Types of vision processors: (1) Vision chip

Vision chip can provide simple functions, e.g. smoothing, motion estimation.


Types of vision processors: (2) Discrete

Type

Flexibility

Special purpose General purpose

High Low

Eff

icie

ncy

(e.g

. G

OP

S/W

)

ATOM

Tegra K1

EyeQ2

TMPV75

DaVinci

SH7766 Intel

NVIDIA

Texas Instruments

TOSHIBA

ST Microelectrionics

RENESAS

Tablet Mobile PC

Smart phone

Network camera

Automobile


Architecture comparison TOSHIBA

TMPV7506XBG TOSHIBA

TMPV7528XBG ST Micro EyeQ2

RENESAS SH7766

NVIDIA Tegra K1

TI DaVinci TMS320DM814x

CPU

Media Processor

or DSP

SIMD Engine

Accelerator

MPE 266MHz

Control Processor

MPE

MPE

MPE

Affine Transform Accelerator

Filter Accelerator 180MHz 64 PEs

Filter Accelerator 64 PEs

Histogram Accelerator

HOG Accelerator

Matching Accelerator

MeP 266MHz

SIMD Engine 133MHz

PE

1

PE

2

PE

3

PE

64

…

DSP (C674x+) 750MHz

Resizer Accelerator (x 1/16 to 8)

SH4A 534MHz

ARM Cortex-A8

1GHz

IMP-X2 266MHz (IntegralImage etc.)

IMR-X 1ch (Affine)

IMR-LSX 4ch (Affine)

MIPS34K 332MHz

MIPS34K 332MHz

VMP

VMP

VMP

Classification

Preprocess Window

Filter (Integral Image)

Disparity Finder

Tracker

MPE 150MHz

MPE

MPE


Trend: Heterogeneous multicore architecture

MPE 266MHz

MPE

MPE

MPE


Filter Accelerator 180MHz 64 PEs

Filter Accelerator 64 PEs

Histogram Accelerator

HOG Accelerator

Matching Accelerator

MeP 266MHz

ARM Cortex-A9 300MHz

ARM Cortex-A9

ARM Cortex-A15

2.3GHz

ARM Cortex-A15

2.3GHz

ARM Cortex-A15

2.3GHz

ARM Cortex-A15

2.3GHz

CUDA 192 cores

ISP

ISP

[Tanabe12], [TMPV]

[TMPV] [EyeQ] [SH] [Tegra] [DaVinci]








Contents


TMPV7506XBG Block Diagram

Speaker I2S

RGB888 / 565

LED (7-seg) 8

DDR2 DRAM

NOR Flash

DDR2-533 SDRAM

16-bit x 2

NOR Flash

CAN

UART / SPI / I2C

camera

camera

camera

camera

Video

Input

I/F

Video Output

I/F

Media Processing Engine (MPE)

#1 #2 #3 #4

Accelerators

Affine Transform

Filter 1 Histogram

32-bit RISC CPU Main Memory

Controller On-chip 2MB RAM

WVGA LCD Panel

Peripherals

CAN

GPIO

Serial I/F

Timer

PCM I/F

MCU I/F CAN MCU

MediaLB/ MOST

CAN

GPIO

Input Capture/Output Compare /PWM

TMPV7506XBG

RGB888 / 666 / 565 YCbCr422 BT.656 Y8 – Y12 8-12bit Bayer

Other ECU

PCI Express

16-bit 2CS

Matching Filter 2 HOG

Multi-core Architecture

for Multiple (up to 4)

Applications

Pedestrians

Lanes & Vehicles

Accelerators for high

performance image processing

4 camera

inputs RGB888 / 565

Traffic Signs


Image Processing Accelerators

DDR2

SDRAM

Controller

DDR2

SDRAM

Controller

NOR Flash

/SRAM

Controller

NOR Flash

/SRAM

Controller

Working

RAM

System

ROM

Serial

I/F

Serial

I/F

Video

Input

I/F

Video

Output

I/F

PCI

Express

MeP

DataCache

Inst.Cache

DMAC

DataRAM

Inst.RAM

MeP

DataCache

Inst.Cache

DMAC

DataRAM

Inst.RAM

DataCacheDataCache

Inst.CacheInst.Cache

DMACDMAC

DataRAMDataRAM

Inst.RAMInst.RAM

L2 Cache

MPE 0 MPE 1 MPE 2 MPE 3

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

L2 Cache


DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM


DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCacheDataCache


DMACDMAC

IVC2IVC2

DataRAMDataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCacheDataCache


DMACDMAC

IVC2IVC2

DataRAMDataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCacheDataCache


DMACDMAC

IVC2IVC2

DataRAMDataRAM

DataCache

Inst.Cache

DMAC

IVC2

DataRAM

DataCacheDataCache


DMACDMAC

IVC2IVC2

DataRAMDataRAM

MCU

I/F

CANCAN

CANCAN

CANCAN

HOG Histogram Filter

Crossbar Switch

Matching Affine

System

RAM

x2

• Heterogeneous multi-core architecture

• Multi-level parallelism – Data=SIMD / Instruction=VLIW / Module＝Image Processing Accelerator (IPA) /

Thread=Multiple cores

Architecture of TMPV7506XBG

Thread-level parallelism with 4 MPEs,

Instruction level with VLIW, and data level with SIMD

Fast image processing using 5 types of IPAs

Wide-band bus with cross bar switch for parallel processing

Flexible memory access

optimization by internal

memories and DMAs


Media Processing Engine (MPE)

IVC2

Core Registers

Inst. Decoder

Data Cache Data RAM

Coprocessor Instr. Decoder

MeP core

Instruction Buffer

16/32

32bit ALU

Instruction Cache

Coprocessor Registers

Pip

e0

Pip

e1

ALU

Instru

ctio

n-

1

Instru

ctio

n-

3

Instru

ctio

n-

2

ALU

MPE

Coprocessor for media processing • 2 instruction pipelines. • Each pipeline can execute a

SIMD (Single Instruction Multiple Data) instruction

• 64-bit register can handle eight 8-bit/four 16-bit/two 32-bit data simultaneously

Media Processing Engine 3 instructions /cycle by VLIW (Very Long Instruction Word) technology

Media embedded Processor • Toshiba original 32-bit RISC

CPU core • low-power consumption


IPA: HOG module

Function Fast computation of HOG/CoHOG[Watanabe10] image feature followed by linear SVM classification

Interface In: gradient orientation image Out: classification result / feature vector

Use case Object (e.g. pedestrian) detection

HOG module

HOG/ CoHOG

f

Linear SVM wTf+b

f

gradient orientation

Parameters w, b


Image feature: HOG and CoHOG

HOG

CoHOG

Combination of gradient orientation

frequency

… … …

… … …

frequency

… …

… …

Combination of gradient orientation

Origin fr

equency

Gradient orientation


LBP(subset)

0 6 3 7 4 2 0 1 4 3 6 2 1 7 4 5 2 0 3 4 6 2 0 4 6 5 7 3 4 1 1 2 6 3 4 0 5 6 4 5 2 2 0 1 4 3 6 2 0 6 3 7 4 2 3 4 6 2 0 4 6 5 7 3 4 1 0 1 4 3 6 2

Encoded image

Flexibility of HOG module

HOG module has a flexibility to compute different types of

co-occurrence histogram according to input data.

HOG module

Block division

Co-occurrence histogram for different pair of pixel positions

Feature vector


1 5 6 0 2 6 0 0 4 3 6 1 3 7 0 5 2 0 6 4 1 2 0 6 8 3 2 3 2 4 0 7 6 0 4 3 2 6 4 5 2 2 1 1 3 3 6 0 0 6 3 7 0 1 2 4 6 2 0 5 6 5 3 0 4 2 2 1 5 2 3 3

CoHOG encodes shape

CoHLBP [Watanabe13]

encodes texture

Pixel combination


IPA: Histogram module

Histogram of intensities

Intensity conversion using a look-up-table

Function Fast histogram generation by parallel voting Data value conversion by a LUT

Interface In: Data array (e.g. image, 1D data array) Out: Histogram / Converted data array

Use case Contrast enhancement by histogram equalization Vote counting for Hough transform


IPA: Filter module

Load/Store unit

PE

1

PE

2

PE

3

PE

64 ・・・

64 processing elements@200MHz

Function Load local image around reference pixel, execute user-defined operations, and replace the reference pixel value with the result

Interface In: Image data array Out: Converted image

Use case Various local operations: e.g. Gaussian filter, Sobel filter median filter, Harris feature point extraction, etc.


IPA: Affine module

Arbitrary image deformation

Lens distortion correction

Affine Transformation

Arbitrary deformation

Conversion table

Affine trans. parameters

Lens distortion parameters


IPA: Matching module

• Template matching by SAD

– To find a position that has minimum SAD value

2D search in a local rectangle

1D search along with an epipolar line

Left image Right image Disparity

Motion estimation

Stereo disparity estimation

time t time t+1


• Back-over Prevention using stereo cameras

– Collision warning for backing up by obstacle/pedestrian detection

• Using commercial wide-angle camera for back-view monitor

– Large lens distortion

Example of optimization: Back-over Prevention

Left image Right image

[okada13]


Processing flow of Back-over Prevention

Image input

Stereo image input Rectified image Disparity map Detection result

Blue = far

Undistortion Depth

estimation Obstacle detection

Pedestrian detection

Warning

Image Feature CoHOG

Classifier Linear SVM


Example of obstacle/pedestrian detection

Crouching person is detected as an obstacle


• Each procedure is assigned to suitable IPA

Implementation on TMPV7506XBG

Undistortion& Rectification

Luminance correction

Depth estimation

Depth→Color

Shrink


Obstacle detection

Pedestrian detection

Affine

Filter

Matching

Histogram

Affine

Filter

MPE

HOG

Image correction

Depth estimation

Obstacle detection

Pattern recognition

IPA/MPE Procedure


⓪ Before optimization 1120ms

x25 ① Use IPAs (sequential procedure) 45ms

Optimization process (1)

Time

HWs

（Display)

Vid

eo r

ate

TMPV7506XBG

（Display)


② Run independent MPEs/IPAs in parallel 42ms ③ Optimize memory access 33ms

– Cache、DataRAM, WorkRAM, DMAC

④ Introduce pipeline procedure 29ms – Perform “undistortion”@Affine for upper half image – When finished, start “luminance correction (zero mean)”@Filter for upper half

image while performing “undistortion”@Affine for lower half image

Optimization process (2) x1.1

LSI Power consumption is about 0.75 W

x1.3

x1.1

Vid

eo r

ate

TMPV7506XBG


• Improved pedestrian/vehicle detection

– Pattern recognition introducing color-based image feature

– Multi-class classification

• Obstacle detection using a single camera

– 3D reconstruction (SfM)

• Realized by image processing accelerators

Future direction of TMPV family

Vehicles

Pedestrians

New

Enhance Pattern recognition

3D reconstruction (SfM)

Next gen.


3D shape (depth) estimation using a camera

3D reconstruction (Structure from Motion)

Multiple images taken from different view angles

3D shape (Depth)

3D reconstruction

Camera motion

Single camera


Obstacle detection 3D position estimation (Point cloud)

Camera motion estimation

Obstacle detection based on SfM

Feature point

Feature vector

Motion estimation

Multi-view stereo matching Obstacle detection

(Every few frames)

Camera motion

R, t 3D point cloud

Obstacle position

Refine

Feature matching

3D position estimation using image frames captured at different moment


Accurate depth information

• Finding point correspondences using multiple images

⇒ Accurate disparity estimation

• Point correspondences are represented by a parametric probability distribution [Vogiatzis11]

⇒ Saving memory consumption


Example of obstacle detection

Distance 32m Height 30cm


Pattern recognition

• Improved recognition accuracy using a new image feature,

Heterogeneous co-occurrence feature [Ito10]

– Extension to CoHOG feature

– Combination of 4 types of color-based image features to describe shape and texture

Example:

color information can tell us the boundary of the pedestrian








Contents


• Another frontier of embedded computer vision

• Current camera system

– records video streams from cameras, and human observers look them over after something has happed

– detects changes and motions

Surveillance camera system

Network cameras

Hub

Recorder

• 既存の監視カメラシステムに接続するだけで、画像解析処理によるインテリジェント機能を付加、必要な情報のみをクラウドに送信することで通信量を大幅削減

• 車載向けに開発された画像認識プロセッサViscontiTM2を搭載を搭載することで、低消費電力、高信頼性を実現

2011-2012 Surveillance camera market and business - CMOS, CCD camera series VOL.1 -

Surveillance camera sales (World) #

cam

era

(k)


• What is a suitable system configuration for video analysis using thousands of cameras?

• Cloud?

Making camera system intelligent

Network camera

Hub

Current camera system

Recorder

Image transfer

Data center

Processing load

Comm. load

[Pham14]


Embedded vision processing can solve the problems

Intelligent surveillance camera system

Network camera

Hub

Recorder

Meta data Data center

Image

Video analysis set-top-box

Vision processor

顔DB

Face Recognition

Human Identification

Vision processor

Data size Processing load


• TMPV7506XBG analyzes captured images in the camera

• Example of application: Multiple object detection

– Four different types of objects are detected simultaneously

Intelligent camera using TMPV7506XBG

Total power consumption is

5-6 W

TMPV7506XBG


Video analysis set-top-box using TMPV7506XBG

The set-top-box can analyze up to 4 camera images

Video Analysis STB

Camera images Application on cloud

People Count

Trajectory








Contents


• Accelerators are often used for realizing specific applications

• Some of technologies are introduced to general purpose processors to achieve higher efficiency

General trend of processor LSIs

Time

Eff

icie

ncy

(e.g

. G

OP

S/W

)

General purpose processors

•3D graphics

•Image compression

•Super computer

Automotive

Wearable?

GPU Codec SIMD


• Heterogeneous multicore architecture stays dominant

– CPU cores + GPGPU (+ Accelerators)

• Many functions will be realized by software after 2020

Future direction of vision processors

Intel ATOM

Tegra K1

EyeQ2

TMPV7506XBG

DaVinci

SH7766

Share

Time

Eff

icie

ncy

(e.g

. G

OP

S/W

)

2010 2015 2005

Minimum performance required for practical apps.

2020

Wider application

range of CV will

open up

Limited users


• Type of vision processors

– Vision chip: Logic-circuit-embedded image sensor

– Discrete LSI : Heterogeneous multi-core architecture


– Toshiba’s TMPV family:

• 5 types of image processing accelerators

– Future direction

• Color-based image feature, multi-class classifier, SfM

• Vision processors will make surveillance cameras intelligent efficiently

– Efficiency is achieved by good balance between on-site processing and cloud processing

• Future direction

– Progress of LSI technology will widen CV application range

Summary


[Mead89] Carver Mead, Analog VLSI and Neural Systems" Addison-Wesley Pub

[Nitta92] Y. Nitta, et al., Proposal of an Optical Neurochip with Internal Analogue Memory and Its Fundamental Characteristics,

Japanese journal of applied physics. Pt. 2, Letters 31(8B), L1182-L1184, 1992

[Bernard93] T. M. Bernard, Y. Zavidovique and F. J. Devos: A Programmable Artificial Retina,

IEEE J. Solid-State Circuits, vol.28, no.7, pp.789-798, 1993.

[Astrom96] A. Astrom, J.-E. Eklund and R. Forchheimer: Global Feature Extraction Operations for Near-Sensor Image Processing, IEEE Trans. Image Processing, vol.5, no.1, pp.102-110, 1996.

[Ishii96] I. Ishii, et al., Target Tracking Algorithm for 1ms Visual Feedback System Using Massively Parallel Processing,

Proc. IEEE Int. Conf. Robotics and Automation, pp.2309-2314, 1996

[Nakabo02] Y. Nakabo, et al., 3D Tracking Using Two High-Speed Vision Systems,

Proc. of IEEE/RSJ Int. Conf. Intelligent Robots and Systems, pp360-365, 2002

[Johansson03] R. Johansson, L. Lindgren, J. Melander and B. Moller: A Multi-Resolution 1000 GOPS 4 Gpixels/s Programmable CMOS Image

Sensor for Machine Vision, Proc. IEEE Workshop on Charge-Coupled Devices and Advanced Image Sensors, 2003.

[Tanabe12] Y. Tanabe, et al. A 464GOPS 620GOPS/W Heterogeneous Multi-Core SoC for Image-Recognition Applications,

ISSCC Dig Tech Papers, pp. 15-16, 2012

[Watanabe10] T. Watanabe and et al., Co-occurrence Histogram of Oriented Gradients for Human Detection,

IPSJ Trans. on Computer Vision and Applications, Vol. 2, pp. 39-47, 2010

[Watanabe13] T. Watanabe and S. Ito, Two co-occurrence histogram features using gradient orientations and

local binary patterns for pedestrian detection, Proc. of ACPR, pp. 415-419, 2013

[Okada13] R. Okada, T. Watanabe, M. Nishiyama, A. Seki, T. Kozakaya, M. Banno, Multiple Object Detection using Image

Recognition LSI for Automobiles, Proc. of 20th ITS World Congress, No. 4185, 2013

[Vogiatzis11] G.Vogiatzis, et al., Video-based, real-time multi-view stereo, Image and Vision Computing, Vol.29, No.7, pp.434-441, 2011.

[Pham14] Pham, et al., DIET: Dynamic Integration of Extended Tracklets for Tracking Multiple Persons, Proc. of ICPR, 2014 (To be appeared)

[Ito10] S. Ito and S. Kubota, Object Classification Using Heterogeneous Co-occurrence features, Proc. of ECCV, 2010

[TMPV] http://www.semicon.toshiba.co.jp/eng/product/assp/automotive/infotain/tmpv7500/index.html

[EyeQ] http://www.mobileye.com/technology/processing-platforms/eyeq2/

[SH] http://hk.renesas.com/applications/automotive/adas/surround/sh7766/index.jsp

[Tegra] http://www.nvidia.com/object/tegra-k1-processor.html

[DaVinci] http://www.tij.co.jp/jp/lit/ds/symlink/tms320dm8148.pdf

References


Product names (mentioned herein) may be trademarks of their respective companies.

lecture 15 ryuzo okada - vision processors for embedded computer vision

Software

vision chip vision chip

pes filter accelerator

resizer accelerator

arm cortexa9 arm cortexa15

arm cortexa8

digital processing element

ch affine mips34k

ch affine imrlsx