r-cnn minus r - university of oxfordvgg/publications/2015/lenc15a/presentation.pdf · r-cnn minus...

27
R-CNN minus R Karel Lenc, Andrea Vedaldi Visual Geometry Group, Department of Engineering Science

Upload: others

Post on 28-Jan-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

R-CNN minus R

Karel Lenc, Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

Page 2: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Object detection 2

Goal: tightly enclose objects of a certain type in a bounding box

“bikes” “planes”

“birds”“horses”

Page 3: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Top performer: Region proposals + CNN 3

WHERE

WHAT

proposal

generation

CNN chair

CNN

CNN

background

potted

plant

[Girshick et al. 2013, He et al. 2014, 2015]

Page 4: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

WHAT

Convolutional neural networks

E.g. region classification with AlexNet, VGG VD

[Krizhevsky et al. 2012, Simonyan Zisserman 2014]

WHERE

Segmentation algorithm

E.g. region proposal from selective search

[Uijlings et al. 2013]

Top performer: Region proposals + CNN 4

Page 5: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Can CNN understand where as well as what? 5

proposal

generationWhere? WHAT

WHERE

Convolutional

neural network

?

Page 6: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Approaches to object detection

Scanning windows

▶ Sliding windows

▶ HOG detector [Dalal Triggs 2005]

▶ DPM [Felzenszwalb et al. 2008]

▶ Cascaded windows

▶ AdaBoost [Viola Jones 2004]

▶ MKL [Vedaldi et al. 2009]

▶ B and Bound [Lampert et al. 2009]

▶ Jumping windows

▶ [Sivic et al. 2008]

▶ Selective windows

▶ [Endres and Hoeim 2010, Uijlings

et. al 2011, Alexe et al. 2012, Gu et

al. 2012]

Hough voting

▶ Implicit shape models [Amit Geman

1997, Leibe et al. 2003]

▶ Max margin [Maji Berg 2009], Random

Forests [Gall Lempitsky 2009]

Classifiers & features

▶ linear SVMs, kernel SVM, Fisher

Vectors, … [Cinbis et al. 2013, …]

▶ convolutional neural networks

[Sermanet et al. 2014, Girshick et al.

2014, …]

▶ HOG, SIFT, C-SIFT, …

[van de Sande et al. 2010, …]

▶ Segmentation cues, … [Shotton et al.

2008, Cinbis et al. 2013, …]

6

Page 7: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Evolution of object detection 7

PASCAL VOC 2007 data

DPM [Felzenszwalb et

al.]

MKL [Vedaldi et al.]

DPMv5 [Girshick et al.]

Regionlet [Wang et al.]

RCNN-Alex [Girshick et al.]

RCNN-VGG [Girshick et al.]

0

10

20

30

40

50

60

70

2008 2009 2010 2011 2012 2013 2014 2015

mA

P [

%]

Year

Page 8: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Pros: simple and effective

Cons: slow as the CNN is re-evaluated for each tested region

R-CNN 8

[Girshick et al. 2013]

CNN chair

CNN

CNN

background

potted

plant

c5c1 c2 c3 c4 f6 f7 f8(SVM)

label

Page 9: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

SPP R-CNN

Convolutional features = local features

Region descriptor = pooled local features

▶ Spatial pyramid + max pooling [He et al. 2014]

▶ Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015]

Order of magnitudes speedup

9

[He et al. 2014]

chair

bowl

potted

plant

c5c1 c2 c3 c4

f6 f7 f8(SVM)

f6 f7 f8(SVM)

f6 f7 f8(SVM)

local features pooling encoder

Page 10: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Computational cost

SPP-CNN results in a significant test-time speedup

However, region proposal extraction is the new bottleneck

R-CNN minus R: can we get rid of region proposal extraction?

10

Detection time

0 2000 4000 6000 8000 10000 12000

SPP-CNN

R-CNN

Avg. Time per Image [ms]

Sel. Search CNN evaluation

Page 11: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

Page 12: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

Page 13: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

A complex learning pipeline

1. Pre-train a large CNN (on ImageNet)

2. Extract region proposals (on PASCAL VOC)

3. Use pre-processed regions to:

1. Fine-tune the CNN

2. Learn an SVM to rank regions

3. Learn a bounding-box regressor to refine localization

13

(SPP) R-CNN training comprises many steps

label (fine tuning)c5c1 c2 c3 c4 f6 f7 f8

SVM label (ranking)

linear

regress. b. box

Page 14: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

A complex learning pipeline

With SPP R-CNN of [He et al. 2014]

fine-tuning is limited to the fully connected layers

14

(SPP) R-CNN training comprises many steps

labelc5c1 c2 c3 c4 f6 f7 f8

SVM label

linear

regress. b. box

frozen

Page 15: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Streamlining R-CNN

Up to a simple transformation, softmax is just as good as hinge loss for box ranking.

15

Removing the SVM

phase score(s) learning loss mAP

fine tuning 𝑆𝑐 = exp( 𝑤𝑐 , 𝜙 𝒙 + 𝑏𝑐) − log

𝑆𝑐0𝑆0 + 𝑆1 + 𝑆2 + …+ 𝑆𝐶

38.1

region ranking

𝑄1 = 𝑤1, 𝜙 𝒙 + 𝑏1⋮

𝑄𝐶 = 𝑤𝐶 , 𝜙 𝒙 + 𝑏𝐶

max 0, 1 − 𝑦 𝑄1⋮

max{0, 1 − 𝑦 𝑄𝐶}59.8

region raking 𝑄𝑐 = log𝑆𝑐𝑆0

from fine-tuning 58.4

Page 16: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Streamlining R-CNN and SPP-CNN

SPP and bounding box regressions can be easily implemented in a CNN (with a DAG

topology) and trained jointly in one step

16

See also [Fast R-CNN and Faster R-CNN]

labelc5c1 c2 c3 c4 f6 f7 f8

lin.

regress.b. box

frozen

labelc5c1 c2 c3 c4 f6 f7 f8

fregb. box

SPP

Page 17: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

Page 18: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

A constant-time region proposal generator

Proposals are now very fast but very inaccurate

We let the CNN compensate with the bounding box regressor

18

Algorithm

Preprocessing

Collect all the training bounding boxes (x1,y1,x2,y2)

Use K-means to extract K clusters in (x1,y1,x2,y2) space

Proposal generation

Regardless of the image, return the same K cluster centers

Page 19: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Proposal statistics on PASCAL VOC 19

ground truthselective search

2K

sliding windows

7K

clustering

3K

Page 20: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Information pathways 20

[See also Lenc Vedaldi CPVR 2015]

labelc5c1 c2 c3 c4 f6 f7 f8

linear

regress.bounding box

shared local

features

where path

what path

equivariant

representation

invariant

representation

Page 21: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

CNN-based bounding box regression

Dashed line: proposals Solid line: corrected by the CNN

21

Page 22: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Performance

Observations

▶ Selective search is much better than fixed generators

▶ However, bounding box regression almost eliminates the difference

▶ Clustering allows to use significantly less boxes than sliding windows

22

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Sel. Search (2Kboxes)

Slid. Win. (7KBoxes)

Clusters (2KBoxes)

Clusters (7KBoxes)

mA

P(V

OC

07)

Baseline BBR

Page 23: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Timings 23

Finding (1) Streamlining accelerates SPP

0 50 100 150 200 250 300 350 400 450

SPP

Streamlined SPP

Avg. Time per Image [ms]

Im. Prep. GPU↔CPU CONV Layers Spat. Pooling FC Layers Bbox Regr.

Page 24: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Timings 24

Finding (2) Dropping selective search is a huge benefit

0 500 1000 1500 2000 2500 3000

SPP

Streamlined SPP

Minus R

Avg. Time per Image [ms]

Sel. Search Im. Prep. GPU↔CPU CONV Layers

Spat. Pooling FC Layers Bbox Regr.

Page 25: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Timings 25

Finding (2) Dropping selective search is a huge benefit

0 2000 4000 6000 8000 10000 12000

RCNN

SPP

Streamlined SPP

Minus R

Avg. Time per Image [ms]

Sel. Search Im. Prep. GPU↔CPU CONV Layers

Spat. Pooling FC Layers Bbox Regr.

Page 26: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Timings 26

Test-time speedups

1.0

4.5

5.0

67.5

0 10 20 30 40 50 60 70 80

RCNN

SPP

StreamlinedSPP

Minus R

Times faster than R-CNN

Page 27: R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time 0 2000 4000 6000

Conclusions

Current CNNs can localize objects well

▶ External segmentation cues bring only a minor benefit at a great expense

Benefits of CNN-only solutions

▶ Much faster, particularly at test time

▶ Much simpler and streamlined implementations

Future steps

▶ Eliminate the remaining accuracy gap

▶ Essentially achieved in

[Faster R-CNN, Ren et al. 2015]

▶ Beyond bounding boxes

▶ Beyond detection

27