deep learning and weak supervision for image...

Deep learning and weak supervisionfor image classification

Matthieu CordJoint work with Thibaut Durand, Nicolas Thome

Sorbonne Universités - Université Pierre et Marie Curie (UPMC)Laboratoire d’Informatique de Paris 6 (LIP6) - MLIA Team

UMR CNRS

June 09, 2016

1/35

Outline

Context: Visual classification1. MANTRA: Latent variable

model to boost classificationperformances

2. WELDON: extension toDeep CNN

2/35

Motivations

• Working on datasets with complex scenes (large and clutteredbackground), not centered objects, variable size, ...

VOC07/12 MIT67 15 Scene COCO VOC12 Action

• Select relevant regions → better prediction• ImageNet: centered objects

I Efficient transfert: needs bounding boxes [Oquab, CVPR14]

• Full annotations expensive ⇒ training with weak supervision

3/35

Motivations

How to learn without bounding boxes?

• Multiple-Instance Learning/Latent variables for missinginformation [Felzenszwalb, PAMI10]

• Latent SVM and extensions => MANTRA

How to learn deep without bounding boxes?

• Learning invariance with input image transformationsI Spatial Transformer Networks [Jaderberg, NIPS15]

• Attention models: to select relevant regionsI Stacked Attention Networks for Image Question Answering

[Yang, CVPR16]• Parts model

I Automatic discovery and optimization of parts for imageclassification [Parizi, ICLR15]

• Deep MILI Is object localization for free? [Oquab, CVPR15]I Deep extension of MANTRA: WELDON

4/35

Notations

Variable Notation Space Train Test ExampleInput x X observed observed imageOutput y Y observed unobserved labelLatent h H unobserved unobserved region

• Model missing information with latent variables h• Most popular approach in Computer Vision: Latent SVM[Felzenszwalb, PAMI10] [Yu, ICML09]

5/35

Latent Structural SVM [Yu, ICML09]

• Prediction function:

(y, h) = argmax(y,h)∈Y×H

〈w,Ψ(x, y,h)〉 (1)

I Ψ(x, y,h): joint feature mapI Joint inference in the (Y ×H) space

• Training: a set of N labeled trained pairs (xi , yi )I Objective function: upper bound of ∆(yi , yi )

12‖w‖2+

C

N

N∑i=1

max(y,h)∈Y×H

[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]−maxh∈H

〈w,Ψ(xi , yi ,h)〉︸︷︷︸≥∆(yi ,yi )

I Difference of Convex Functions, solved with CCCPI LAI: max

(y,h)∈Y×H[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]

I Challenge exacerbated in the latent case, (Y ×H) space

6/35

MANTRA: Minimum Maximum Latent Structural SVM

Classifying only with themax scoring latent valuenot always relevant

MANTRA model:• Pair of latent variables (h+

i ,y,h−i ,y)

I max scoring latent value: h+i,y = argmax

h∈H〈w,Ψ(xi , y,h)〉

I min scoring latent value: h−i,y = argminh∈H

〈w,Ψ(xi , y,h)〉

• New scoring function:

Dw(xi , y) = 〈w,Ψ(xi , y,h+i ,y)〉+ 〈w,Ψ(xi , y,h−i ,y)〉 (2)

• Prediction function ⇒ find the output with maximum scorey = argmax

y∈YDw(xi , y) (3)

• MANTRA: max+min vs max for LSSVM ⇒ negativeevidence

7/35

MANTRA: Model & Training Rationale

Intuition of the max+min prediction function

• x image, h image region, y image class• 〈w,Ψ(x, y,h)〉: region h score for class y• Dw(x, y) = 〈w,Ψ(x, y,h+

y )〉+ 〈w,Ψ(x, y,h−y )〉I h+

y : presence of class y ⇒ large for yiI h−y : localized evidence of the absence of class y

I Not too low for yi ⇒ latent space regularizationI Low for y 6= yi ⇒ tracking negative evidence [Parizi, ICLR15]

street image x Dw(x, street) = 2 Dw(x, highway) = 0.7 Dw(x, coast) = −1.5

8/35

MANTRA: Model Training

Learning formulation

• Loss function: `w(xi , yi ) = maxy∈Y

[∆(yi , y) + Dw(xi , y)]− Dw(xi , yi )

I (Margin rescaling) upper bound of ∆(yi , y), constraints:

∀y 6= yi , Dw(xi , yi )︸︷︷︸score for ground truth output

≥ ∆(yi , y)︸︷︷︸margin

+ Dw(xi , y)︸︷︷︸score for other output

• Non-convex optimization problem

minw

12‖w‖2 +

C

N

N∑i=1

`w(xi , yi ) (4)

• Solver: non convex one slack cutting plane [Do, JMLR12]I Fast convergenceI Direct optimization 6= CCCP for LSSVMI Still needs to solve LAI: maxy [∆(yi , y) + Dw(xi , y)]

9/35

MANTRA: Optimization

• MANTRA Instantiation: define (x, y,h), Ψ(x, y,h), ∆(yi , y)

• Instantiations: binary & multi-class classification, AP rankingBinary Multi-class AP Ranking

x bag bag set of bags(set of regions) (set of regions) (of regions)

y ±1 1, . . . ,K ranking matrixh instance (region) region regions

Ψ(x, y, h) y · Φ(x, h)[1(y=1)Φ(x, h), . . . , joint latent ranking1(y=K)Φ(x, h)] feature map

∆(yi , y) 0/1 loss 0/1 loss AP lossLAI exhaustive exhaustive exact and efficient

• Solve Inference maxy Dw(xi , y) & LAI maxy [∆(yi , y) + Dw(xi , y)]I Exhaustive for binary/multi-class classificationI Exact and efficient solutions for ranking

10/35

WELDONWeakly supErvised Learning of Deep cOnvolutional Nets

• MANTRA extension for training deep CNNs• Learning Ψ(x, y,h): end-to-end learning of deep CNNs withstructured prediction and latent variables

I Incorporating multiple positive & negative evidenceI Training deep CNNs with structured loss

11/35

Standard deep CNN architecture: VGG16

Simonyan et al. Very deep convolutional networks for large-scale image recognition.ICLR 2015

12/35

MANTRA adaptation for deep CNN

Problem

• Fixed-size image as input

Adapt architecture to weakly supervised learning

1. Fully connected layers → convolution layersI sliding window approach

13/35

MANTRA adaptation for deep CNN

Problem

• Fixed-size image as input

Adapt architecture to weakly supervised learning

1. Fully connected layers → convolution layersI sliding window approach

2. Spatial aggregationI Perform object localization prediction

13/35

WELDON: deep architecture

• C : number of classes14/35

Aggregation function

[Oquab, 2015]

• Region aggregation = max• Select the highest-scoring window

original image motorbike feature map max prediction

Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervisedlearning with convolutional neural networks. CVPR 2015 15/35

WELDON: region aggregation

Aggregation strategy:

• max+min pooling (MANTRA prediction function)• k-instances

I Single region to multiple high scoring regions:

max→ 1k

k∑i=1

i-thmax min→ 1k

k∑i=1

i-thmin

I More robust region selection [Vasconcelos CVPR15]

max max+min 3max+3min16/35

WELDON: architecture

17/35

WELDON: learning

• Objective function for multi-class task and k = 1:

minwR(w) +

1N

N∑i=1

`(fw(xi ), ygti )

fw(xi ) =argmaxy

(maxh

Lwconv(xi , y , h) + min

h′Lw

conv(xi , y , h′))

How to learn deep architecture ?

• Stochastic gradient descent training.• Back-propagation of the selecting windows error.

18/35

WELDON: learning

Class is present

• Increase score of selecting windows.

Figure: Car map

19/35

WELDON: learning

Class is absent

• Decrease score of selecting windows.

Figure: Boat map

20/35

Experiments

• VGG16 pre-trained on ImageNet• Torch7 implementation

Datasets

• Object recognition: Pascal VOC 2007, Pascal VOC 2012• Scene recognition: MIT67, 15 Scene• Visual recognition, where context plays an important role:COCO, Pascal VOC 2012 Action

VOC07/12 MIT67 15 Scene COCO VOC12 Action

21/35

Experiments

Dataset Train Test Classes ClassificationVOC07 ∼5.000 ∼5.000 20 multi-labelVOC12 ∼5.700 ∼5.800 20 multi-label15 Scene 1.500 2.985 15 multi-classMIT67 5.360 1.340 67 multi-classVOC12 Action ∼2.000 ∼2.000 10 multi-labelCOCO ∼80.000 ∼40.000 80 multi-label

22/35

Experiments

• Multi-scale: 8 scales (combination with Object Bank strategy)

23/35

Object recognition

VOC 2007 VOC 2012VGG16 (online code) [1] 84.5 82.8SPP net [2] 82.4Deep WSL MIL [3] 81.8WELDON 90.2 88.5

Table: mAP results on object recognition datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014[3] Oquab et al. Is object localization for free? CVPR 2015

24/35

Scene recognition

15 Scene MIT67VGG16 (online code) [1] 91.2 69.9MOP CNN [2] 68.9Negative parts [3] 77.1WELDON 94.3 78.0

Table: Multi-class accuracy results on scene categorization datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional ActivationFeatures. ECCV 2014[3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015

25/35

Context datasets

VOC 2012 action COCOVGG16 (online code) [1] 67.1 59.7Deep WSL MIL [2] 62.8Our WSL deep CNN 75.0 68.8

Table: mAP results on context datasets.

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Oquab et al. Is object localization for free? CVPR 2015

26/35

Visual results

Aeroplane model (1.8) Bus model (-0.4)

27/35

Visual results

Motorbike model (1.1) Sofa model (-0.8)

28/35

Visual results

Sofa model (1.2) Horse model (-0.6)

29/35

Visual results (failing examples)

Buffet Restaurant kitchen

30/35

Visual results (failing examples)

Kindergarden Classroom

31/35

Analysis

Impact of the different improvementsa) max b) +k=3 c) +min d) +AP VOC07 VOC12 action

X 83.6 53.5X X 86.3 62.6X X 87.5 68.4X X X 88.4 71.7X X X 87.8 69.8X X X X 88.9 72.6

• WSL detection results on VOC 2012 Action

max (a)) [Oquab, 2015] WELDONIoU 25.6 30.4

32/35

Analysis

• Impact of the number or regions k

k=1 k=3

33/35

Connections to others Latent Variables Models• Hidden CRF (HCRF) [Quattoni, PAMI07]

12‖w‖2 +

C

N

N∑i=1

log∑

(y,h)∈Y×H

exp〈w,Ψ(xi , y,h)〉 − log∑h∈H

exp〈w,Ψ(xi , yi ,h)〉

• Latent Structural SVM (LSSVM) [Yu, ICML09]

12‖w‖2 +

C

N

N∑i=1

max(y,h)∈Y×H

∆(yi ,y)+〈w,Ψ(xi ,y,h)〉 −maxh∈H〈w,Ψ(xi ,yi ,h)〉

• Marginal Structural SVM (MSSVM) [Ping, ICML14]

12‖w‖2+

C

N

N∑i=1

maxy

∆(yi ,y)+log

∑h∈H

exp〈w,Ψ(xi ,y,h)〉

−log

∑h∈H

exp〈w,Ψ(xi ,yi,h)〉

• WELDON

12‖w‖2+

C

N

N∑i=1

maxy

∆(yi ,y)+∑

h∈Ω⊆H〈w,Ψ(xi ,y,h)〉

−∑h∈Ω⊆H〈w,Ψ(xi , yi ,h)〉

34/35

Thibaut Durand Nicolas Thome Matthieu Cord

MLIA Team (Patrick Gallinari)Sorbonne Universités - UPMC Paris 6 - LIP6

MANTRA project pagehttp://webia.lip6.fr/~durandt/project/mantra.html

Thibaut Durand, Nicolas Thome, and Matthieu Cord.MANTRA: Minimum Maximum LSSVM for Image Classification and Ranking.In IEEE International Conference on Computer Vision (ICCV), 2015.

Thibaut Durand, Nicolas Thome, and Matthieu Cord.WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

35/35

http://webia.lip6.fr/~durandt/project/mantra.html

deep learning and weak supervision for image...

Documents