module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...

Module 5

Deep Convnets for Local RecognitionJoost van de Weijer4 April 2016

Previously, end-to-end..

2Slide credit: Jose M Àlvarez

Dog


3Slide credit: Jose M Àlvarez

Dog

Learned Representation

4

Dog


Part I: End-to-end learning (E2E)


5



Task A(eg. image classification)


6


Domain BFine-tuned


Part I’: End-to-End Fine-Tuning (FT)


Domain ALearned Representation


Transfer

Previously,finetuning..

slide credit: X. Giro

7

Fine-tuning a pre-trained network

Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)


https://imatge.upc.edu/web/publications/layer-wise-cnn-surgery-visual-sentiment-prediction



8Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

Fine-tuning a pre-trained network

Fine-tuning: High learning rate in new layer, and low learning rate in all other layers.





9

Task A(eg. image classification)



Task B(eg. image retrieval)Part II: Off-the-shelf features

Previously, off-the-shelf features..

slide credit: X. Giro

Orange

Image classification: image as an input, label as output

spatial coded image representations(like spatial pyramids)

x y Fd d d

orderless image representation (like BOW)

1 1 Fd

Previously, off-the-shelf features..

Two deep lectures in M5

Global Scale(today’s lecture)

Local Scale(next lecture)

Deep ConvNets for Recognition at...

Orange

Image ClassificationImage classification: image as an input, label as output

How to process non-squared images ?

resize zero padding largest centred square

Local object recognition

object localization

(single object)

object detection

semantic segmentation

Classification+LOCALIZATION

slide credit: Li, Karpathy, Johnson

Localization as regression


regression head

classification head



Localization as regressionClassification head:C- class scores

regression head:Cx4 - numbers


Problem: multiple classes

Localization as regression (example)

Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015)

Esteve Cervantes: Evaluating deep features for Fashion Recognition


object localization

(single object)

object detection


any ideas ?

Sliding window227

22

7

227

22

7

0.03

classification + regression

227

22

7

227

22

7

0.83classification + regression

Compute a new regressed bounding box and classification score for all sliding window positions.

Sliding window

227

22

7

Repeat for different scales and combine all results (e.g. with non maxima suppression)

22

7

227

0.83

0.99

Sliding window (efficient computation)

Let us for simplicity consider a simple three layer network

5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

What are the spatial coordinates of conv1 ?

10

10

12x17

conv1 filter(5x5)

Part of the convolutionalfeatures are the same and do not need recomputation!



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

How many 10x10 windows are there in this 12x17 image ?



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1

13

8

5

The convolutions can be computed in a single pass.



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1

13

8

5 6x6x5

1x1x10

fc2



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1(5x5x3)

13

8

5

8

103

fc2=conv2(6x6x5)



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 1(5x5x3)

13

8

5

8

103

fc2=conv2(6x6x5)

1x1x2

fc3



5x5

10

10

conv 1 fc1 fc2

car/not car

6

6

5

10

1

2

1

10

10

12x17

conv1 filter(5x5)

5x5

17

12

conv 15 fillters of (5x5x3)

13

8

5

8

103

fc2=conv210 filters of (6x6x5)

8

23

fc3=conv32 filters of (1x1x10)

We have the 8x3=24 classification scores sharing computation of the convolutional feaures.

Example of bear and fish detection on multiple scales.

Semanet et al, ‘Integrated Recognition, Localization and Detection using Convolutional Networks’ ICLR 2014

Networks can be written as fully convolutional networks to speed up computation at testing time.


object proposals

selective search

K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

• object proposal methods compute boxes which potentially contain an object.

• Features for each box are extracted and a classifier is applied.

• typically thousands of boxes (but much less than sliding window)

• Many different approaches: selective search, edge boxes, GOP, etc.

object proposals (RCNN)

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

bounding box regression



Alex Net



Alex Net

remove last layer and finetune for 20 PASCAL classes

Use fc7 4096-d vector as the description of the bounding box.

Train a SVM on this representation for classification


slide credit: Girshick



1. compute object proposals (~2k)

2. warp dilated bounding box

4. classify regions

3. compute CNN features

car: yesperson : no

improved bounding box

drawbacks:• not end-to-end• warping of boxes• lots of double computation (overlap of bounding boxes)

object proposals (Fast R-CNN)


He, Kaiming, et al. "Spatial pyramid pooling in deep convolutionalnetworks for visual recognition." PAMI 2015

‘conv 5’ • compute ones the convolutional features per image.

shar

ed

co

mp

uta

tio

n(c

on

v1-c

on

v5)


This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

• compute ones the convolutional features• extract features from conv5 for all bb’s

shar

ed c

om

pu

tati

on

‘conv 5’


• pool the features in a spatial grid.

for all bounding boxes:Region of Interest pooling(ROI pooling)

shar

ed c

om

pu

tati

on


• pool the features in a spatial grid

ROI pooling:

FCsclassification:log loss

regression:smooth L1 loss

end-to-end training

shar

ed c

om

pu

tati

on


Fast R-CNN R-CNN

Train time 9.5 84

-speedup 8.8x -

Test time/image 0.32s 47s

Test speedup 146x -

mAP 66.9% 66.0%

multi-task improves also classification performance. end-to-end improves results

Test time does not include object proposal computation (which is now the bottleneck)

object proposals (Faster R-CNN)

shar

ed c

om

pu

tati

on

‘conv5’

compute the object proposals directly in the network.

FCs Region Proposal Network (RPN)

ROI pooling:


slide credit: Kaming He

Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.



Slide a window over the feature map.

Add a network which classifies and regresses the bounding boxes.

The classification score provides the confidence of the presence of object.

Use N anchors for proposals of varying aspect ratios.



Model Time

Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime

Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime

faster R-CNN 1*ConvTime + 1000*FcTime

Computation for 1000 boxes.


slide credit: Li, Karpathy, johnson

object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN.

http://www.image-net.org/challenges/LSVRC/2015/index




object localization

Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN





summary object detection


• object localization: when there is one or a known number of objects/classes you can do object localization by adding a ‘regression head’ to your network.

• Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network.

• Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider:

• adding a regression head to improve bounding box estimation.• share computation of the convolutional features (SPP)• end-to-end training of network (fast RCNN)• include Region Proposal Network for fast object proposals within the network (faster RCNN).


object localization

(single object)

object detection



semantic segmentation:assign a class to all pixels

instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

semantic segmentationConvNet

predict center pixel

Because of the convolutions the resolution is smaller and upsampling is required

Write network as fully convolutionalnetwork and apply to image


Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015

pixelwise loss



Convolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut

semantic segmentationConvolution (3x3)padding[1 1 1 1]stride [1 1]

inp

ut



inp

ut


inp

ut

semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]

inp

ut

semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]

inp

ut

• deconvolutions are also called fractionally strided convolutions, convolution transpose.


Noh et al. ICCV 2015


combine where (local, shallow) with what (global, deep)




interp + sum

interp + sum

dense output

‘skip layers’



stride 32

no skips

stride 16

1 skip

stride 8

2 skips

ground truthinput image


Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015


Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015

Surface normalsresults

instance segmentation

Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.

instance segmentation


results ground-truth

Generative Adversarial Networks


Fractionally strided convolutions (deconvolutions) can be used to generate images.

noise


Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.

max log log 1D

D x D G z

G(z)

generated horses

I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z)

x

real horses

D



maxlog log 1D

D x D G z

G(z)

generated horses

I can then optimize my generative network to fool the discriminative network.

x

real horses

D

minG



G(z)

generated horses

You can re-optimize the Discriminate network D, etc...

x

real horses

D

log oax l g 1mD

D x D G z minG



G(z)

generated horses

You can re-optimize the Discriminate network D, etc...until D gives in...

x

real horses

D

log oax l g 1mD

D x D G z minG

Goodman et al. Generative Adversarial NetsNIPS 2014


Examples of generated bedrooms.Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016


Interpolation between points in z.

Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016

summary semantic segmentation


• Fully convolutional networks can be applied for efficient classification of all pixels.• To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers).• upsampling can be done by de-convolution and de-pooling operations.• Instance segmentation can be performed by combining object detection and semantic segmentation pipelines.

module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...

Documents