module 5 deep convnets for local recognition · deep convnets for local recognition joost van de...
TRANSCRIPT
Module 5
Deep Convnets for Local RecognitionJoost van de Weijer4 April 2016
Previously, end-to-end..
2Slide credit: Jose M Àlvarez
Dog
Previously, end-to-end..
3Slide credit: Jose M Àlvarez
Dog
Learned Representation
4
Dog
Learned Representation
Part I: End-to-end learning (E2E)
Previously, end-to-end..
5
Learned Representation
Part I: End-to-end learning (E2E)
Task A(eg. image classification)
Previously, end-to-end..
6
Part I: End-to-end learning (E2E)
Domain BFine-tuned
Learned Representation
Part I’: End-to-End Fine-Tuning (FT)
Part I: End-to-end learning (E2E)
Domain ALearned Representation
Part I: End-to-end learning (E2E)
Transfer
Previously,finetuning..
slide credit: X. Giro
7
Fine-tuning a pre-trained network
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
Previously,finetuning..
8Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
Fine-tuning a pre-trained network
Fine-tuning: High learning rate in new layer, and low learning rate in all other layers.
Previously,finetuning..
9
Task A(eg. image classification)
Learned Representation
Part I: End-to-end learning (E2E)
Task B(eg. image retrieval)Part II: Off-the-shelf features
Previously, off-the-shelf features..
slide credit: X. Giro
Orange
Image classification: image as an input, label as output
spatial coded image representations(like spatial pyramids)
x y Fd d d
orderless image representation (like BOW)
1 1 Fd
Previously, off-the-shelf features..
Two deep lectures in M5
Global Scale(today’s lecture)
Local Scale(next lecture)
Deep ConvNets for Recognition at...
Orange
Image ClassificationImage classification: image as an input, label as output
How to process non-squared images ?
resize zero padding largest centred square
Local object recognition
object localization
(single object)
object detection
semantic segmentation
Classification+LOCALIZATION
slide credit: Li, Karpathy, Johnson
Localization as regression
slide credit: Li, Karpathy, Johnson
slide credit: Li, Karpathy, Johnson
Localization as regression
regression head
classification head
Localization as regression
slide credit: Li, Karpathy, Johnson
regression head
classification head
Localization as regression
slide credit: Li, Karpathy, Johnson
Localization as regression
slide credit: Li, Karpathy, Johnson
Localization as regressionClassification head:C- class scores
regression head:Cx4 - numbers
slide credit: Li, Karpathy, Johnson
Problem: multiple classes
Localization as regression
slide credit: Li, Karpathy, Johnson
Localization as regression (example)
Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015)
Esteve Cervantes: Evaluating deep features for Fashion Recognition
Local object recognition
object localization
(single object)
object detection
semantic segmentation
any ideas ?
Sliding window227
22
7
227
22
7
0.03
classification + regression
227
22
7
227
22
7
0.83classification + regression
Compute a new regressed bounding box and classification score for all sliding window positions.
Sliding window
227
22
7
Repeat for different scales and combine all results (e.g. with non maxima suppression)
22
7
227
0.83
0.99
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
What are the spatial coordinates of conv1 ?
10
10
12x17
conv1 filter(5x5)
Part of the convolutionalfeatures are the same and do not need recomputation!
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
How many 10x10 windows are there in this 12x17 image ?
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
5x5
17
12
conv 1
13
8
5
The convolutions can be computed in a single pass.
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
5x5
17
12
conv 1
13
8
5 6x6x5
1x1x10
fc2
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
5x5
17
12
conv 1(5x5x3)
13
8
5
8
103
fc2=conv2(6x6x5)
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
5x5
17
12
conv 1(5x5x3)
13
8
5
8
103
fc2=conv2(6x6x5)
1x1x2
fc3
Sliding window (efficient computation)
Let us for simplicity consider a simple three layer network
5x5
10
10
conv 1 fc1 fc2
car/not car
6
6
5
10
1
2
1
10
10
12x17
conv1 filter(5x5)
5x5
17
12
conv 15 fillters of (5x5x3)
13
8
5
8
103
fc2=conv210 filters of (6x6x5)
8
23
fc3=conv32 filters of (1x1x10)
We have the 8x3=24 classification scores sharing computation of the convolutional feaures.
Example of bear and fish detection on multiple scales.
Semanet et al, ‘Integrated Recognition, Localization and Detection using Convolutional Networks’ ICLR 2014
Networks can be written as fully convolutional networks to speed up computation at testing time.
Sliding window (efficient computation)
object proposals
selective search
K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.
• object proposal methods compute boxes which potentially contain an object.
• Features for each box are extracted and a classifier is applied.
• typically thousands of boxes (but much less than sliding window)
• Many different approaches: selective search, edge boxes, GOP, etc.
object proposals (RCNN)
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.
1. compute object proposals (~2k)
2. warp dilated bounding box
4. classify regions
3. compute CNN features
car: yesperson : no
bounding box regression
object proposals (RCNN)
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.
Alex Net
object proposals (RCNN)
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.
Alex Net
remove last layer and finetune for 20 PASCAL classes
Use fc7 4096-d vector as the description of the bounding box.
Train a SVM on this representation for classification
object proposals (RCNN)
slide credit: Girshick
object proposals (RCNN)
object proposals (RCNN)
slide credit: Li, Karpathy, Johnson
object proposals (RCNN)
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.
1. compute object proposals (~2k)
2. warp dilated bounding box
4. classify regions
3. compute CNN features
car: yesperson : no
improved bounding box
drawbacks:• not end-to-end• warping of boxes• lots of double computation (overlap of bounding boxes)
object proposals (Fast R-CNN)
object proposals (Fast R-CNN)
He, Kaiming, et al. "Spatial pyramid pooling in deep convolutionalnetworks for visual recognition." PAMI 2015
‘conv 5’ • compute ones the convolutional features per image.
shar
ed
co
mp
uta
tio
n(c
on
v1-c
on
v5)
object proposals (Fast R-CNN)
This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015
• compute ones the convolutional features• extract features from conv5 for all bb’s
shar
ed c
om
pu
tati
on
‘conv 5’
object proposals (Fast R-CNN)
• pool the features in a spatial grid.
for all bounding boxes:Region of Interest pooling(ROI pooling)
shar
ed c
om
pu
tati
on
object proposals (Fast R-CNN)
• pool the features in a spatial grid
ROI pooling:
FCsclassification:log loss
regression:smooth L1 loss
end-to-end training
shar
ed c
om
pu
tati
on
object proposals (Fast R-CNN)
Fast R-CNN R-CNN
Train time 9.5 84
-speedup 8.8x -
Test time/image 0.32s 47s
Test speedup 146x -
mAP 66.9% 66.0%
multi-task improves also classification performance. end-to-end improves results
Test time does not include object proposal computation (which is now the bottleneck)
object proposals (Faster R-CNN)
shar
ed c
om
pu
tati
on
‘conv5’
compute the object proposals directly in the network.
FCs Region Proposal Network (RPN)
ROI pooling:
object proposals (Faster R-CNN)
slide credit: Kaming He
Slide a window over the feature map.
Add a network which classifies and regresses the bounding boxes.
The classification score provides the confidence of the presence of object.
object proposals (Faster R-CNN)
slide credit: Kaming He
Slide a window over the feature map.
Add a network which classifies and regresses the bounding boxes.
The classification score provides the confidence of the presence of object.
Use N anchors for proposals of varying aspect ratios.
object proposals (Faster R-CNN)
slide credit: Kaming He
Model Time
Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime
Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime
faster R-CNN 1*ConvTime + 1000*FcTime
Computation for 1000 boxes.
object proposals (Faster R-CNN)
slide credit: Li, Karpathy, johnson
object proposals (Faster R-CNN)
slide credit: Li, Karpathy, johnson
object localization
Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN.
object localization
Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 challenge with residual networks and Faster RCNN
summary object detection
slide credit: Li, Karpathy, johnson
• object localization: when there is one or a known number of objects/classes you can do object localization by adding a ‘regression head’ to your network.
• Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network.
• Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider:
• adding a regression head to improve bounding box estimation.• share computation of the convolutional features (SPP)• end-to-end training of network (fast RCNN)• include Region Proposal Network for fast object proposals within the network (faster RCNN).
Local object recognition
object localization
(single object)
object detection
semantic segmentation
semantic segmentation
semantic segmentation:assign a class to all pixels
instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)
semantic segmentationConvNet
predict center pixel
Because of the convolutions the resolution is smaller and upsampling is required
Write network as fully convolutionalnetwork and apply to image
semantic segmentation
Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015
pixelwise loss
semantic segmentation
Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015
Convolution (3x3)padding[1 1 1 1]stride [1 1]
inp
ut
semantic segmentationConvolution (3x3)padding[1 1 1 1]stride [1 1]
inp
ut
semantic segmentation
Convolution (3x3)padding[1 1 1 1]stride [2 2]
inp
ut
Convolution (3x3)padding[1 1 1 1]stride [1 1]
inp
ut
semantic segmentation
Convolution (3x3)padding[1 1 1 1]stride [2 2]
inp
ut
Convolution (3x3)padding[1 1 1 1]stride [1 1]
inp
ut
semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]
inp
ut
semantic segmentationdeconvolution (3x3)padding [1 1 1 1]stride [2 2]
inp
ut
• deconvolutions are also called fractionally strided convolutions, convolution transpose.
semantic segmentation
Noh et al. ICCV 2015
semantic segmentation
Noh et al. ICCV 2015
semantic segmentation
combine where (local, shallow) with what (global, deep)
Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015
semantic segmentation
Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015
interp + sum
interp + sum
dense output
‘skip layers’
semantic segmentation
Long et al., Fully Convolutional Networksfor Semantic Segmentation, ICCV 2015
stride 32
no skips
stride 16
1 skip
stride 8
2 skips
ground truthinput image
semantic segmentation
Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015
semantic segmentation
Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labelswith a Common Multi-Scale Convolutional Architecture, ICCV 2015
Surface normalsresults
instance segmentation
Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.
instance segmentation
Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.
instance segmentation
Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.
instance segmentation
Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.
results ground-truth
Generative Adversarial Networks
Dai et al. ‘Instance aware Semantic Segmentation via Multi-task Network Cascades’, arXiv 2015.
Fractionally strided convolutions (deconvolutions) can be used to generate images.
noise
Generative Adversarial Networks
Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.
max log log 1D
D x D G z
G(z)
generated horses
I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z)
x
real horses
D
Generative Adversarial Networks
Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.
maxlog log 1D
D x D G z
G(z)
generated horses
I can then optimize my generative network to fool the discriminative network.
x
real horses
D
minG
Generative Adversarial Networks
Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.
G(z)
generated horses
You can re-optimize the Discriminate network D, etc...
x
real horses
D
log oax l g 1mD
D x D G z minG
Generative Adversarial Networks
Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z.
G(z)
generated horses
You can re-optimize the Discriminate network D, etc...until D gives in...
x
real horses
D
log oax l g 1mD
D x D G z minG
Goodman et al. Generative Adversarial NetsNIPS 2014
Generative Adversarial Networks
Examples of generated bedrooms.Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016
Generative Adversarial Networks
Interpolation between points in z.
Unsupervised Representation Radford et al. Learning with Deep ConvolutionalGenerative Adversarial Nteworks ICLR 2016
summary semantic segmentation
slide credit: Li, Karpathy, johnson
• Fully convolutional networks can be applied for efficient classification of all pixels.• To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers).• upsampling can be done by de-convolution and de-pooling operations.• Instance segmentation can be performed by combining object detection and semantic segmentation pipelines.