deep learning for computer vision (2/4): object analytics @ lasalle 2016

119
Xavier Giró i Nieto, “Deep learning for vision: Objects”. Master in Multimedia, La Salle URL (May 2016) @DocXavi Deep Learning for Computer Vision Object Analytics 5 May 2016 Xavier Giró-i-Nieto Master en Creació Multimedia

Upload: xavier-giro

Post on 07-Jan-2017

882 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)DocXavi

Deep Learning for Computer VisionObject Analytics 5 May 2016

Xavier Giroacute-i-Nieto

Master en Creacioacute Multimedia

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in three parts

2

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

3

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

4

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals Hand-crafted

5

Slides credit Marc Bolantildeos

Hand-crafted proposals used to be based on bottom-up proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171

[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 2: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in three parts

2

Images (global) Objects (local)

Deep ConvNets for Recognition for

Video (2D+T)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

3

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

4

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals Hand-crafted

5

Slides credit Marc Bolantildeos

Hand-crafted proposals used to be based on bottom-up proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171

[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 3: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

3

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

4

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals Hand-crafted

5

Slides credit Marc Bolantildeos

Hand-crafted proposals used to be based on bottom-up proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171

[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 4: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

4

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals Hand-crafted

5

Slides credit Marc Bolantildeos

Hand-crafted proposals used to be based on bottom-up proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171

[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 5: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals Hand-crafted

5

Slides credit Marc Bolantildeos

Hand-crafted proposals used to be based on bottom-up proposals

Selective Search (SS) Multiscale Combinatorial Grouping (MCG)

[SS] Uijlings Jasper RR Koen EA van de Sande Theo Gevers and Arnold WM Smeulders Selective search for object recognition International journal of computer vision 104 no 2 (2013) 154-171

[MCG] Arbelaacuteez Pablo Jordi Pont-Tuset Jonathan Barron Ferran Marques and Jitendra Malik Multiscale combinatorial grouping CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 6: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

6

Kuo Weicheng Bharath Hariharan and Jitendra Malik Deepbox Learning objectness with convolutional networks ICCV 2015 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 7: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox

7

Slides credit Marc Bolantildeos

Deepbox proposes a very simple method1) Use a state-of-the-art method (Edge Box) to generate initial object proposals2) Rerank them (and possibly discard them) by using DeepBox

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 8: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Architecture

8

Slides credit Marc Bolantildeos

PASCAL VOCAUC = 075 IoU = 05AUC = 062 IoU = 07

PASCAL VOCAUC = 074 IoU = 05AUC = 060 IoU = 07

AlexNetarchitecture

(heavier)

DeepBoxarchitecture

(lighter)

Small drop

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 9: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Training

9

Slides credit Marc Bolantildeos

1) Initialize layers with AlexNet weights 3) Train on Hard Negatives

2) Train on Sliding WindowsNegative SamplesExtract windows by raster scanning

Positive SamplesHaving GT bounding boxes they

generate samples per instance

with a perturbation of

By using bottom-up proposals from Edge boxes

If GT overlap threshold lt= 03 rarr Negative Samples

If GT overlap threshold gt= 07 rarr Positive Samples

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 10: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

10

DeepBox Edge Boxes DeepBox Edge Boxes

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 11: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

11

With a rather simple approach ConvNets can obtain much better results than previous techniques for Object Proposals

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 12: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

12

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 13: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Proposals DeepBox Results

13

Increasing not only Detection capabilities of known classes but also of unknown ones (suitable for Object Discovery)

Slides credit Marc Bolantildeos

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 14: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

14

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 15: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

15

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 16: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

16

DPM (HOG features)[1] R-CNN [2] SPPnet [3]

Hand-crafted features Deep features

+60

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 17: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects

17

Girshick Ross Forrest Iandola Trevor Darrell and Jitendra Malik Deformable Part Models are Convolutional Neural Networks CVPR 2015

Convnets (CNNs) actually learn similar detectors to the ones learned by Deformable Parts-based Models (DPMs)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 18: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

18

Girshick R Donahue J Darrell T amp Malik J Rich feature hierarchies for accurate object detection and semantic segmentation CVPR 2014

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 19: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

19

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 20: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

20

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 21: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects R-CNN

21

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 22: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

22

Girshick Ross Fast R-CNN ICCV 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 23: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

23

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 24: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

24

Slide credit Amaia Salvador

Same as SPP[3] but single scale

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 25: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

25

He Kaiming Xiangyu Zhang Shaoqing Ren and Jian Sun Spatial pyramid pooling in deep convolutional networks for visual recognition PAMI 2015

Slide credit Joost van de Weijer

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 26: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

26

Slide credit Amaia Salvador

H

h

w

h

w

Size of pooling binsh Hrsquo x w Wrsquo

wWrsquo

hHrsquomax pooling

CONV5

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 27: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

27

Slide credit Amaia Salvador

AlexNet [4] VGG16 [5] VGG_1024 [6]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 28: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Fast R-CNN

28

Slide credit Amaia Salvador

Multi-task loss

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 29: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

29

Ren S He K Girshick R and Sun J 2015 Faster R-CNN Towards real-time object detection with region proposal networks In Advances in Neural Information Processing Systems (pp 91-99) [Python code] [Matlab code]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 30: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

30

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Object Proposal computation is the bottleneck in current state of the art object detection systems

Selective Search Van de Sande K E Uijlings J R Gevers T amp Smeulders A W (2011 November) Segmentation as selective search for object recognition InComputer Vision (ICCV) 2011 IEEE International Conference on (pp 1879-1886) IEEECPMC Carreira J amp Sminchisescu C (2010 June) Constrained parametric min-cuts for automatic object segmentation In Computer Vision and Pattern Recognition (CVPR) 2010 IEEE Conference on (pp 3241-3248) IEEEMCG Arbelaacuteez P Pont-Tuset J Barron J Marques F amp Malik J (2014) Multiscale combinatorial grouping In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 328-335)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 31: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

31

Slide credit Amaia Salvador

Selective Search CPMC

MCG

Replace the usage of external Object Proposals with a Region Proposal Network (RPN)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 32: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

32

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 33: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

33

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 34: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

34

Slide credit Amaia Salvador

Objectness scores(objectno object)

Bounding Box Regression

In practice k = 9 (3 different scales and 3 aspect ratios)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 35: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

35

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 36: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

36

Slide credit Amaia Salvador

Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 37: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

37

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN RPN Proposals

RPN Proposals

Class probabilities

RoI pooling layerFC layersClass scores

4-step training to share features for RPN and Fast R-CNN

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 38: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

38

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 1 Train RPN initialized with an ImageNet pre-trained model

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 39: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

39

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 1)

Class probabilities

Step 2 Train Fast R-CNN with learned RPN proposals

ImageNet weights(fine tuned)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 40: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

40

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rsRPN RPN Proposals

Step 3 The model trained in 2 is used to initialize RPN and train again

Weights from Step 2(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 41: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

41

Slide credit Amaia Salvador

Conv Layer 5

Co

nv

laye

rs

RPN Proposals (learned in 3)

Class probabilities

Step 4 Fine tune FC layers of Fast R-CNN using same shared convolutional layers as in 3

Weights from Step 2amp3(fixed)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 42: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

42

Slide credit Amaia Salvador

Detection Accuracy (Pascal VOC)

Timing in ms (Pascal VOC)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 43: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

43

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 44: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

44

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 45: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Objects Faster R-CNN

45

Slide credit Amaia Salvador

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 46: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 46

Detection Objects Reinforcement L

Caicedo Juan C and Svetlana Lazebnik Active object localization with deep reinforcement learning ICCV 2015 [Slides by Miriam Bellver]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 47: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 47

Detection Objects Reinforcement LObject is localized based on visual features from AlexNet FC6

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 48: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 48

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Transformation actions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 49: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 49

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of actions A

Terminates the sequence of the current search

Marks the region inhibition-of-return (IoR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 50: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 50

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Set of states S

(oh)

o = feature vector from pre-trained CNN fc6 4096 dim

h = history of taken actions binary vector dim 90

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 51: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 51

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function Rground-truthbounding box

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 52: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 52

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Reward Function R for trigger action

The Reward function considers the number of steps as a cost

3

minimum IoU06

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 53: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 53

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Policy function

If the current state is S which should be the next action A

Reinforcement Learning using a Q-learning

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 54: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 54

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

The action-value function is estimated using a neural network that

has as many output units as actions the algorithm incorporates a replay-memory to collect experiences category-specific Q-network

Policy of the agent selection action A with maximum estimated value of the learnt action-value function

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 55: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 55

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 56: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 56

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Datasets for training and testing PASCAL VOC

Two modes of evaluation

1) All attended Regions (AAR)2) Terminal regions (TR)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 57: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 57

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Best performance with few region proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 58: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 58

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 59: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 59

Detection Objects Reinforcement Slide credit Miacuteriam Bellver

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 60: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces

60

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 61: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection FacesDDFD

61

Farfade Sachin Sudhakar Mohammad Saberian and Li-Jia Li Multi-view Face Detection Using Deep Convolutional Neural Networks ICMR (2015) [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 62: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

62

Dataset Source Annotated Facial Landmarks in the Wild by TU Graz 25k annotated faces on images downloaded from Flickr 380k manually annotated facial landmarks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 63: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Train

63

Randomly samples sub-windows (blocks) Positive examples if Intersection-over Union (IoU) with an annotated

face is larger than 50 and negative sample otherwise

Total samples 200K positive and 20M negative

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 64: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

64

Test images are rescaled updown 3 times per octave to find different sizes

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 65: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

65

Sliding window of 227x227 over the test image

Source James Hays ldquoObject Category Detetcion Sliding Windowsrdquo (Brown University 2011)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 66: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

66

Fully-connected layers are converted to convolutional layers which allows processing images from any size

Long Jonathan Evan Shelhamer and Trevor Darrell Fully Convolutional Networks for Semantic Segmentation CVPR 2015

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 67: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

67

This makes possible to Efficiently run the convnet on images of any size Obtain a heat-map of the face etector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 68: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Test

68

Non-Maximum Suppression (NMS) to avoid overlapped detections

Source Adrian Rosebrock ldquoNon-Maximum Suppression for Object Detection in Pythonrdquo (Pyimagesearch 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 69: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

69

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 70: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Detection Faces DDFD Results

70

Precision vs Recall Curves

- DPM corresponds to Deformable Part-based Models- OpenCV face detector is an implementation of Viola amp Jones- IMPORTANT DPM or Headhunter need extra information about pose or facial landmarks during

training

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 71: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

71

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 72: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 72

Faces Recognition FaceNet

Schroff Florian Dmitry Kalenichenko and James Philbin FaceNet A Unified Embedding for Face Recognition and Clustering CVPR 2015

(Extended summary slides by Xavier Giro on the ReadCV seminar)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 73: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 73

Faces Recognition FaceNet

FacesEuclidean space where distances correspond to face similarity

FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 74: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 74

Faces Recognition FaceNetEnd-to-end learning of an embedding (distance metric learning)

Weinberger Kilian Q and Lawrence K Saul Distance metric learning for large margin nearest neighbor classification The Journal of Machine Learning Research 10 (2009) 207-244

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 75: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 75

Faces Recognition FaceNetby means of well chosen triplets using curriculum learning

Bengio Yoshua Jeacuterocircme Louradour Ronan Collobert and Jason Weston Curriculum learning In Proceedings of the 26th annual international conference on machine learning pp 41-48 ACM 2009

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 76: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 76

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 77: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 77

Faces Recognition FaceNet

Zeiler Matthew D and Rob Fergus Visualizing and understanding convolutional networks In Computer

VisionndashECCV 2014 pp 818-833 Springer International Publishing 2014 (Slides by Xavier Giroacute-i-Nieto)

Architecture 1 (NN1) ZF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 78: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 78

Faces Recognition FaceNetArchitecture 2 (NN2) GoogLeNet

Szegedy Christian Wei Liu Yangqing Jia Pierre Sermanet Scott Reed Dragomir Anguelov Dumitru Erhan Vincent

Vanhoucke and Andrew Rabinovich Going Deeper With Convolutions CVPR 2015 (Slides by Elisa Sayrol)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 79: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 79

Faces Recognition FaceNet

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 80: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 80

Faces Recognition FaceNet Test

LBW 9963 (new record)YouTubeFaces DB 9512

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 81: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 81

Faces Recognition FaceNet SoftwareSoftware implementation OpenFace

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 82: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 82

Faces Recognition VGG Face

Parkhi Omkar M Andrea Vedaldi and Andrew Zisserman Deep face recognition Proceedings of the British Machine Vision 1 no 3 (2015) 6 [software]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 83: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

E Mohedano Salvador A McGuinness K Giroacute-i-Nieto X OConnor N and Marqueacutes F ldquoBags of Local Convolutional Features for Scalable Instance Searchrdquo ICMR 2016

83

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 84: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 84

Objects Recognition Retrieval

Image Database

Visual Query

ldquoA dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 85: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 85

Objects Recognition Retrieval

Image Database

Visual Query

ldquoThis dogrdquo

Expected outcome

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 86: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 86

Instance Retrieval(Instance Object Building Person Placehellip)

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 87: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 87

Objects Recognition Retrieval

v1 = (v11 hellip v1n)

vk = (vk1 hellip vkn)

INVERTED FILE

word Image ID1 1 12 2 1 30 1023 10 124 23 6 10

Local hand-crafted features(eg SIFT)

Bag of Visual WordsN-Dimensional

feature space High-dimensionalHighly sparse

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 88: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 88

Objects Recognition Retrieval

Krizhevsky A Sutskever I amp Hinton G E (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems (pp 1097-1105)

Convolutional Neural Networks

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 89: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 89

Objects Recognition Retrieval

Babenko A Slesarev A Chigorin A amp Lempitsky V (2014) Neural codes for image retrieval In ECCV 2014Razavian A Azizpour H Sullivan J amp Carlsson S (2014) CNN features off-the-shelf an astounding baseline for recognition In DeepVision CVPRW 2014

Convolutional Neural Networks FC layers as global feature representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 90: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 90

Objects Recognition Retrieval

Babenko A amp Lempitsky V (2015) Aggregating local deep features for image retrieval ICCV 2015Tolias G Sicre R amp Jeacutegou H (2015) Particular object retrieval with integral max-pooling of CNN activations ICLR 2015Kalantidis Y Mellina C amp Osindero S (2015) Cross-dimensional Weighting for Aggregated Deep Convolutional Features arXiv preprint arXiv151204065

Convolutional Neural Networks

summax pooled conv features as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 91: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 91

Objects Recognition Retrieval

Ng J Yang F amp Davis L (2015) Exploiting local features from deep networks for image retrieval In DeepVision CVPRW 2015

Convolutional Neural Networks

conv features encoded with VLAD as global representation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 92: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 92

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 93: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 93

Objects Recognition Retrieval

(336x256)Resolution

conv5_1 from VGG16[1]

(42x32)

25K centroids 25K-D vector

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 94: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 94

Objects Recognition RetrievalQuery Representation

Global Search(GS)

Local Search(LS)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 95: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 95

Objects Recognition Retrieval

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 96: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

96

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 97: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation

97

Slide credit Eduard Fontdevila

Semantic segmentation assign a category label to all pixels in an image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 98: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

98

Farabet Clement Camille Couprie Laurent Najman and Yann LeCun Learning hierarchical features for scene labeling TPAMI 2013

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 99: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

99

Pyramid of three spatial scales

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 100: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

100

The same parameters in the three convnets

theta_i=theta_0=filters weights (H_l) and biases b_l)

Non-linear tanhPooling max

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 101: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

101

Upsampling and concatenation

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 102: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

102

Pixel-wise soft-max classifier

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 103: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

103

Problem No spatial consistency among labels

3 explored solutions

1) Superpixels2) Conditional Random Fields3) Parameter-free multilevel parsing

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 104: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

104

Prediction with a 2-layer network

Solution 1 Superpixels

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 105: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

105

Prediction with a 2-layer network

Solution 2 Superpixels + CRF

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 106: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

106

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

BPT [Garrido Salembier]

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 107: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

107

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 108: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation Farabet

108

Solution 3 Multi-level parsing

Problems with Solutions 1 amp 2 Observation level

Contribution Automatically discover the best observation level (optimal cover) for each pixel in the image

C2 will be labelled with the class of C5

For each pixel (leaf) i the optimal component is the C_i is the one along the path between the leaf and the root with minimal cost S

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 109: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

109

Slide credit Eduard Fontdevila

Hariharan Arbelaez Girshick Malik Simultaneous Detection and Segmentation (ECCV 2014)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 110: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

110

Slide credit Eduard Fontdevila

Interest in obtaining segments not just bounding boxes

Multiscale combinational grouping (MCG) to generate object candidates

Cuts algorithm

Hierarchical segmenter

Grouping strategy to combine

multiscale regions

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 111: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

111

Slide credit Eduard Fontdevila

BBOX CNNfeature vector

1

feature vector

2

[1 2]

Finetuned to classify bboxes (with background) so extracting features from the region foreground is

suboptimal

BBOX CNN

vector A

background masked out with the mean image

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 112: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

112

Slide credit Eduard Fontdevila

Training 2 networks trained in isolation

Testing results are combined

BBOX CNNfeature vector

1

feature vector

2

[1 2]

REGION CNN

vector B

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 113: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

113

Slide credit Eduard Fontdevila

Training as a whole (using segmentation overlap)

Testing results are combined (using the output of the penultimate layer)

vector C

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 114: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

114

Slide credit Eduard Fontdevila

penultimate fully connected layer

SVM

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 115: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

115

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 116: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

116

Slide credit Eduard Fontdevila

Results on pixel IU (Jaccard index) to evaluate semantic segmentation

Convert the output of the final system (C+ref) into a pixel-level

category labeling (using pasting scheme Carreira et al)

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 117: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

Objects Segmentation SDS

117

Slide credit Eduard Fontdevila

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 118: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016)

One lecture organized in four parts

118

Detection Recognition

Local analysis for

Segmentation

person

bag

me

my bagperson

bag

Proposals

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto

Page 119: Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Xavier Giroacute i Nieto ldquoDeep learning for vision Objectsrdquo Master in Multimedia La Salle URL (May 2016) 119

Thank you

httpsimatgeupceduwebpeoplexavier-giro

httpstwittercomDocXavi

httpswwwfacebookcomProfessorXavi

xaviergiroupcedu

Xavier Giroacute-i-Nieto