"tailoring convolutional neural networks for low-cost, low-power implementation," a...

Copyright © 2015 Synopsys Inc. 1

Bruno Lavigueur

12 May 2015

Tailoring CNNs for Low-cost,

Low-power Implementations


• Embedded vision subsystem, build from many silicon proven IPs

• DesignWare: ARC HS processor, AXI, DMA, Memory Compiler, …

• HAPS FPGA-based rapid prototyping system

Synopsys at a Glance

>5,300 Masters/PhD

Degrees

>2,300 IP Designers

>1,500 Applications

Engineers

>$2.2B FY14

Revenue

32% Revenue

on R&D

>9,300 Employees


• Convolutional Neural Network (CNN)

• Wide range of detection and classification possible

• The majority of the published CNN graphs are not tailored for embedded

• Memory requirements

• Number of floating point operations (# of MAC)

• Yet CNN have nice properties for parallelization on embedded devices

• Regular processing, feed forward dataflow, no data dependant computation

• Key questions

• Can the size and complexity of the graph be reduced with minimal impact on detection rates ?

• Number of layers, connectivity, size of convolution

• What is the impact of moving from floating to fixed point ?

CNN on Embedded Devices


How CNN Works (Once Trained)

• Multiple feature extraction layers

• Progressive refinement process

• Each successive layer extracts more complex features (higher level)

• Last layer performs classification

• Same computation (neuron) replicated multiple times

Input image Layer 1

Low level feature extraction

Pooling & down sampling

Layer 2

Mid-level features

Partially connected

Layer 3

High-level

features

Fully

connected

classification


• Each layer of convolutions extract progressively higher level features

• Subsampling / max pooling to “zoom out” and detect bigger objects

with smaller convolutions

• Non-linear function on each neuron to activate it

Visualising a CNN

Layer 1 output

sample

Layer 2 output

sample

Layer 3 output

sample

Layer 4 output

sample


• Convolution of

multiple inputs

together

• Fixed kernel size

• Optional subsampling

• 1, 2, 4x

• Optional max-pooling

• Very regular, repetitive

computation

• Dominated by MAC

• Deterministic

• Non-linear activation

function (sigmoid,

hyperbolic tangent,

rectifier)

CNN Computation

I0

IM-1

I1

O0

ON-1

M inputs

(XI * YI) Z kernels (K * K) with

associated weights

N outputs (XO * YO)

Oj = act(Bj+ (Iv x Kw) + …)

Convolution (x)

act

act

Activation (tanh, ReLU) …


• Given the nature of the algorithm,

there are many ways to accelerate

CNNs including:

• Vector / SIMD unit

• Systolic array / Streaming

• GPU

• Performance / Power / Area trade-offs will vary

• Depending on the architecture

• In all cases the main limitations will be

• Amount of closely coupled memory available

• Maximum number of Giga-MAC/s that can be sustained

• I/O bandwidth required & available

• Optimized data movement, efficient streaming

Moving Towards Embedded CNN

EV Processor

Shared

Memory

DMA

Interconnect

RISC CPU

32-bit

Core

32-bit

Core

32-bit

Core

32-bit

Core

CNN Engine

…

…

PE PE PE

PE PE PE


Moving CNN to Embedded Systems

• Graph Complexity

• Number of layers

(depth)

• Size of the

convolutions filters

• Number of

connections

between the layers

Compute requirements ALU width/cost Memory size

Input

Layer 1 Layer 2 Layer 3 Layer 4

3 2 1

1 2 6

1 2 1

0 1

1 0

Image Filter

5 8

3 3

Feature

map

Conv. = 4 6

2 2

Data precision # Coefficients

Act.


• Starting point:

• Multicoreware generated ~10 million faces/non-faces from over 200

Hollywood and Bollywood full length movies

• Trained CNN to detect faces in those movies

Example of a Big& Small CNN Application

Metric Alexnet like Embedded

version

Weight Space 400 MB 0.5 MB

Layers 10

(7Cv+3 FC)

5

(3 Cv+2 FC)

Compute 200x 1x

Bandwidth 400x 1x

F1-Score .963 .905

Accuracy .993 .981

VGA 30 FPS 4800 GOPS 24 GOPS

• Cv: Convolution layers

(partially connected)

• FC: Fully connected

layers


• Using standard open source projects to train networks with floating

point and GPU acceleration to explore network topology

• Cuda-convnet, Caffe, Theano

• Didn’t worry initially about numerical precision as literature has shown

CNN are robust to precision

• From scratch: Small networks can be trained very fast

• Enables lots of shots on goal :

• Using scripting and many GPU’s

• Number of network layers, convolutions, subsampling & pooling

• Explored huge space and quickly converged on a graph with good learning

• From an existing graph: Also worked backwards from high accuracy

large graph

• Iteratively reduced it and retrained the best ones

• End up with similar networks in both cases

Reducing Complexity of the Graph


• Improve F-1 score with classic techniques such as

• Data Normalization

• Hard negative mining (boosting)

• Annealing the learning rate

• Data Augmentation: Flip, Random Cropping, color space, ..

• Moved initial system from F1 of ~.74 to ~.90

• Once the graph topology and training is satisfying look at the impact of

moving to fixed point

• Test below are done with 31437 positive and 263145 negative samples

Training Optimizations

Initial Optimized

True positive 19706 27093

False positive 1769 1335

False negative 11731 4344

F-1 Score 0.7449 0.9051


• Compare output of every layer with reference floating

point version

• Differences may grow after each layer

• Detection threshold might need to be tweaked to

achieve similar results

Moving to Fixed Point: Empirical Approach

ReLU

Image

Filter

Convolution =

Accumulator Feature

map

200 64 1

150 50 1

1 10 220

4 0

0 -1

750 255

590 -20

Non-linear function

750 255

590 0

Shift +

saturate

255 127

255 0

Greyscale

image, 8

bit pixels

Convert to

fixed-point

based on range,

e.g 16 bit

(Q2S13)

Make sure

accumulator

is wide

enough,

e.g. 32 bit

(signed)

Shift-right values to avoid overflow,

x = max(0, x) >> N

Choose ‘N’ according to dynamic

range of ‘x’ values


• FDDB: Face Detection

Data Set and Benchmark

• Results shown for the

embedded small & fixed

point graph

• Localization can be

improved with pre/post

processing

• Impacts scores

• Not done here

Results For Face Detection Application

Type F-1

Best (CascadeCNN) 0.91

Middle 10 average 0.85

Embedded – 40% 0.84

Embedded – 50% 0.82

Fixed point,

8bit


• Design time configurable

• Number of CNN Processing Elements (2 to 8)

• Streaming interconnection network configured for number of cores

• Runtime reconfigurable

• Flexible point-to-point connections between all cores

• CNN-optimized instruction set

• Convolutions, MAC, LUT, …

• Micro-DMA & stream interface for data movement

• Programmable

• Using the generated C compiler

• Each CNN PE has a local data & program memory

Low-cost, Low-power, Flexible CNN

Su

bsys

tem

In

terc

on

nec

t

DMA Shared

DMem

CNN Engine

Reconfigurable

Streaming Interconnect

PE 1 … PE 2 PE 4

PE 5 PE 6 PE 8 …

RISC

MP

32 bit

RISC

32 bit

RISC

32 bit

RISC

32 bit

RISC

Sync


Mapping Example and Performance

L1&4 FIFO L2

L3a

L3b

Subsystem Interconnect

L1 L2 L3 L4

• Input image read only once

• 30 cycles average to do 8 convolutions of 5x5 in parallel

• Including all data movement & contention

• Over 85% MAC resource utilization (8 MACs / CNN PE)

• ~15mW per PE @28nm HPM

• w. memory & interconnect

• Mapping on 4

processing elements

• Smaller layers merged

together

4 PE, 5 FIFO configuration


Demonstrator

ARC EV52 Processor

RISC multi-core Shared

Data

Mem

CNN Engine

DMA

AXI Subsystem Interconnect

PE 8

Core 2

MEM

PE 1

Core 1

MEM

AXI Interconnect

DDR

ARC HS Core

• Read in frame,

• Pyramid (scaling)

• Non-max suppression

• Softmax

• Display the result

AXI 2

UMRBus

CNN graph

Host application

streaming video

frames to DDR over

UMR-bus and back

HAPS 70-S12 Prototyping System

Clocked at 50Mhz

(10% of real-time)

Workstation

webcam


• CNN compute requirement can be dramatically reduced with a small impact

of the detection rates

• Works well when the number of object classes to detect is kept small

• Offline training is the critical step to obtain good performances

• Specialized and programmable hardware can be used to efficiently

implement many different CNN graphs

• Low power and area

• Some pre- and post-processing is needed to have a complete and useful

application

• CNN accelerator coupled with quad-core RISC cluster

• Useful to couple CNN with other processing steps to improve performances

• Shrinking the image when it doesn’t impact detection rates

• Sliding a detection window on an image

• Region of interest

Lessons Learned


• Selected CNN papers

• Embedded facial image processing with Convolutional Neural Networks

• http://liris.cnrs.fr/Documents/Liris-6072.pdf

• Memory-Centric Accelerator Design for Convolutional Neural Networks

• http://parse.ele.tue.nl/system/attachments/58/original/iccdMP17.pdf?1381908921

• CNN tutorial & courses

• Stanford CNN course • http://cs231n.github.io/

• Neural network intro and visualization • http://colah.github.io/

• Synopsys DesignWare Embedded Vision Processors

• http://www.synopsys.com/ev

• More information and demo available at the Technology Showcase (Mission City Ballroom, Tables 3 & 4)

Resources

http://liris.cnrs.fr/Documents/Liris-6072.pdf





http://parse.ele.tue.nl/system/attachments/58/original/iccdMP17.pdf?1381908921



http://cs231n.github.io/

http://cs231n.github.io/

http://colah.github.io/



http://www.synopsys.com/ev

http://www.synopsys.com/ev

"tailoring convolutional neural networks for low-cost, low-power implementation," a...

Technology

cnn layer

layer of convolutions

connected layer

output sample layer

sampling layer

cnn works

successive layer extracts

size of convolution