deep learning for computer vision (2/4): object analytics @ lasalle 2016
TRANSCRIPT
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)DocXavi
Deep Learning for Computer VisionObject Analytics 5 May 2016
Xavier Giroacute-i-Nieto
Master en Creacioacute Multimedia
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in three parts
2
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
3
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
4
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals Hand-crafted
5
Slides credit Marc Bolantildeos
Hand-crafted proposals used to be based on bottom-up proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171
[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in three parts
2
Images (global) Objects (local)
Deep ConvNets for Recognition for
Video (2D+T)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
3
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
4
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals Hand-crafted
5
Slides credit Marc Bolantildeos
Hand-crafted proposals used to be based on bottom-up proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171
[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
3
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
4
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals Hand-crafted
5
Slides credit Marc Bolantildeos
Hand-crafted proposals used to be based on bottom-up proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171
[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
4
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals Hand-crafted
5
Slides credit Marc Bolantildeos
Hand-crafted proposals used to be based on bottom-up proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171
[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals Hand-crafted
5
Slides credit Marc Bolantildeos
Hand-crafted proposals used to be based on bottom-up proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171
[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
6
Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox
7
Slides credit Marc Bolantildeos
Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Architecture
8
Slides credit Marc Bolantildeos
PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07
PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07
AlexNetarchitecture
(heavier)
DeepBoxarchitecture
(lighter)
Small drop
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Training
9
Slides credit Marc Bolantildeos
1) Initialize layers with AlexNet weights 3) Train on Hard Negatives
2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning
Positive SamplesHaving GT bounding boxes they
generate samples per instance
with a perturbation of
By using bottom-up proposals from Edge boxes
If GT overlap threshold lt= 03 rarr Negative Samples
If GT overlap threshold gt= 07 rarr Positive Samples
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
10
DeepBox Edge Boxes DeepBox Edge Boxes
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
11
With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
12
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Proposals DeepBox Results
13
Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)
Slides credit Marc Bolantildeos
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
14
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
15
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
16
DPM (HOG features)[1] R-CNN [2] SPPnet [3]
Hand-crafted features Deep features
+60
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects
17
Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015
Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
18
Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
19
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
20
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects R-CNN
21
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
22
Girshick Ross Fast R-CNN ICCV 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
23
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
24
Slide credit Amaia Salvador
Same as SPP[3] but single scale
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
25
He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015
Slide credit Joost van de Weijer
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
26
Slide credit Amaia Salvador
H
h
w
h
w
Size of pooling binsh Hrsquo x w Wrsquo
wWrsquo
hHrsquomax pooling
CONV5
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
27
Slide credit Amaia Salvador
AlexNet [4] VGG16 [5] VGG_1024 [6]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Fast R-CNN
28
Slide credit Amaia Salvador
Multi-task loss
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
29
Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
30
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Object Proposal computation is the bottleneck in current state of the art object detection systems
Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
31
Slide credit Amaia Salvador
Selective Search CPMC
MCG
Replace the usage of external Object Proposals with a Region Proposal Network (RPN)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
32
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
33
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
34
Slide credit Amaia Salvador
Objectness scores(objectno object)
Bounding Box Regression
In practice k = 9 (3 different scales and 3 aspect ratios)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
35
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
36
Slide credit Amaia Salvador
Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
37
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN RPN Proposals
RPN Proposals
Class probabilities
RoI pooling layerFC layersClass scores
4-step training to share features for RPN and Fast R-CNN
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
38
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 1 Train RPN initialized with an ImageNet pre-trained model
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
39
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 1)
Class probabilities
Step 2 Train Fast R-CNN with learned RPN proposals
ImageNet weights(fine tuned)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
40
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rsRPN RPN Proposals
Step 3 The model trained in 2 is used to initialize RPN and train again
Weights from Step 2(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
41
Slide credit Amaia Salvador
Conv Layer 5
Co
nv
laye
rs
RPN Proposals (learned in 3)
Class probabilities
Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3
Weights from Step 2amp3(fixed)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
42
Slide credit Amaia Salvador
Detection Accuracy (Pascal VOC)
Timing in ms (Pascal VOC)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
43
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
44
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Objects Faster R-CNN
45
Slide credit Amaia Salvador
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46
Detection Objects Reinforcement L
Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47
Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Transformation actions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of actions A
Terminates the sequence of the current search
Marks the region inhibition-of-return (IoR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Set of states S
(oh)
o = feature vector from pre-trained CNN fc6 4096 dim
h = history of taken actions binary vector dim 90
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function Rground-truthbounding box
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Reward Function R for trigger action
The Reward function considers the number of steps as a cost
3
minimum IoU06
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Policy function
If the current state is S which should be the next action A
Reinforcement Learning using a Q-learning
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
The action-value function is estimated using a neural network that
has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network
Policy of the agent selection action A with maximum estimated value of the learnt action-value function
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Datasets for training and testing PASCAL VOC
Two modes of evaluation
1) All attended Regions (AAR)2) Terminal regions (TR)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Best performance with few region proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59
Detection Objects Reinforcement Slide credit Miacuteriam Bellver
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces
60
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection FacesDDFD
61
Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
62
Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Train
63
Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated
face is larger than 50 and negative sample otherwise
Total samples 200K positive and 20M negative
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
64
Test images are rescaled updown 3 times per octave to find different sizes
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
65
Sliding window of 227x227 over the test image
Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
66
Fully-connected layers are converted to convolutional layers which allows processing images from any size
Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
67
This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Test
68
Non-Maximum Suppression (NMS) to avoid overlapped detections
Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
69
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Detection Faces DDFD Results
70
Precision vs Recall Curves
- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during
training
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
71
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72
Faces Recognition FaceNet
Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015
(Extended summary slides by Xavier Giro on the ReadCV seminar)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73
Faces Recognition FaceNet
FacesEuclidean space where distances correspond to face similarity
FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74
Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)
Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75
Faces Recognition FaceNetby means of well chosen triplets using curriculum learning
Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77
Faces Recognition FaceNet
Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer
VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)
Architecture 1 (NN1) ZF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78
Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet
Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent
Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79
Faces Recognition FaceNet
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80
Faces Recognition FaceNet Test
LBW 9963 (new record)YouTubeFaces DB 9512
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81
Faces Recognition FaceNet SoftwareSoftware implementation OpenFace
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82
Faces Recognition VGG Face
Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016
83
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84
Objects Recognition Retrieval
Image Database
Visual Query
ldquoA dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85
Objects Recognition Retrieval
Image Database
Visual Query
ldquoThis dogrdquo
Expected outcome
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86
Instance Retrieval(Instance Object Building Person Placehellip)
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87
Objects Recognition Retrieval
v1 = (v11 hellip v1n)
vk = (vk1 hellip vkn)
INVERTED FILE
word Image ID1 1 12 2 1 30 1023 10 124 23 6 10
Local hand-crafted features(eg SIFT)
Bag of Visual WordsN-Dimensional
feature space High-dimensionalHighly sparse
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88
Objects Recognition Retrieval
Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)
Convolutional Neural Networks
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89
Objects Recognition Retrieval
Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90
Objects Recognition Retrieval
Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065
Convolutional Neural Networks
summax pooled conv features as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91
Objects Recognition Retrieval
Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93
Objects Recognition Retrieval
(336x256)Resolution
conv5_1 from VGG16[1]
(42x32)
25K centroids 25K-D vector
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94
Objects Recognition RetrievalQuery Representation
Global Search(GS)
Local Search(LS)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95
Objects Recognition Retrieval
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
96
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation
97
Slide credit Eduard Fontdevila
Semantic segmentation assign a category label to all pixels in an image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
98
Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
99
Pyramid of three spatial scales
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
100
The same parameters in the three convnets
theta_i=theta_0=filters weights (H_l) and biases b_l)
Non-linear tanhPooling max
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
101
Upsampling and concatenation
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
102
Pixel-wise soft-max classifier
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
103
Problem No spatial consistency among labels
3 explored solutions
1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
104
Prediction with a 2-layer network
Solution 1 Superpixels
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
105
Prediction with a 2-layer network
Solution 2 Superpixels + CRF
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
106
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
BPT [Garrido Salembier]
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
107
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation Farabet
108
Solution 3 Multi-level parsing
Problems with Solutions 1 amp 2 Observation level
Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image
C2 will be labelled with the class of C5
For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
109
Slide credit Eduard Fontdevila
Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
110
Slide credit Eduard Fontdevila
Interest in obtaining segments not just bounding boxes
Multiscale combinational grouping (MCG) to generate object candidates
Cuts algorithm
Hierarchical segmenter
Grouping strategy to combine
multiscale regions
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
111
Slide credit Eduard Fontdevila
BBOX CNNfeature vector
1
feature vector
2
[1 2]
Finetuned to classify bboxes (with background) so extracting features from the region foreground is
suboptimal
BBOX CNN
vector A
background masked out with the mean image
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
112
Slide credit Eduard Fontdevila
Training 2 networks trained in isolation
Testing results are combined
BBOX CNNfeature vector
1
feature vector
2
[1 2]
REGION CNN
vector B
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
113
Slide credit Eduard Fontdevila
Training as a whole (using segmentation overlap)
Testing results are combined (using the output of the penultimate layer)
vector C
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
114
Slide credit Eduard Fontdevila
penultimate fully connected layer
SVM
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
115
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
116
Slide credit Eduard Fontdevila
Results on pixel IU (Jaccard index) to evaluate semantic segmentation
Convert the output of the final system (C+ref) into a pixel-level
category labeling (using pasting scheme Carreira et al)
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
Objects Segmentation SDS
117
Slide credit Eduard Fontdevila
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)
One lecture organized in four parts
118
Detection Recognition
Local analysis for
Segmentation
person
bag
me
my bagperson
bag
Proposals
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto
Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119
Thank you
httpsimatgeupceduwebpeoplexavier-giro
httpstwittercomDocXavi
httpswwwfacebookcomProfessorXavi
xaviergiroupcedu
Xavier Giroacute-i-Nieto