deep learning and weak supervision for image...
TRANSCRIPT
Deep learning and weak supervisionfor image classification
Matthieu CordJoint work with Thibaut Durand, Nicolas Thome
Sorbonne Universités - Université Pierre et Marie Curie (UPMC)Laboratoire d’Informatique de Paris 6 (LIP6) - MLIA Team
UMR CNRS
June 09, 2016
1/35
Outline
Context: Visual classification1. MANTRA: Latent variable
model to boost classificationperformances
2. WELDON: extension toDeep CNN
2/35
Motivations
• Working on datasets with complex scenes (large and clutteredbackground), not centered objects, variable size, ...
VOC07/12 MIT67 15 Scene COCO VOC12 Action
• Select relevant regions → better prediction• ImageNet: centered objects
I Efficient transfert: needs bounding boxes [Oquab, CVPR14]
• Full annotations expensive ⇒ training with weak supervision
3/35
Motivations
How to learn without bounding boxes?
• Multiple-Instance Learning/Latent variables for missinginformation [Felzenszwalb, PAMI10]
• Latent SVM and extensions => MANTRA
How to learn deep without bounding boxes?
• Learning invariance with input image transformationsI Spatial Transformer Networks [Jaderberg, NIPS15]
• Attention models: to select relevant regionsI Stacked Attention Networks for Image Question Answering
[Yang, CVPR16]• Parts model
I Automatic discovery and optimization of parts for imageclassification [Parizi, ICLR15]
• Deep MILI Is object localization for free? [Oquab, CVPR15]I Deep extension of MANTRA: WELDON
4/35
Notations
Variable Notation Space Train Test ExampleInput x X observed observed imageOutput y Y observed unobserved labelLatent h H unobserved unobserved region
• Model missing information with latent variables h• Most popular approach in Computer Vision: Latent SVM[Felzenszwalb, PAMI10] [Yu, ICML09]
5/35
Latent Structural SVM [Yu, ICML09]
• Prediction function:
(y, h) = argmax(y,h)∈Y×H
〈w,Ψ(x, y,h)〉 (1)
I Ψ(x, y,h): joint feature mapI Joint inference in the (Y ×H) space
• Training: a set of N labeled trained pairs (xi , yi )I Objective function: upper bound of ∆(yi , yi )
12‖w‖2+
C
N
N∑i=1
max(y,h)∈Y×H
[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]−maxh∈H
〈w,Ψ(xi , yi ,h)〉︸ ︷︷ ︸≥∆(yi ,yi )
I Difference of Convex Functions, solved with CCCPI LAI: max
(y,h)∈Y×H[∆(yi , y) + 〈w,Ψ(xi , y,h)〉]
I Challenge exacerbated in the latent case, (Y ×H) space
6/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with themax scoring latent valuenot always relevant
MANTRA model:• Pair of latent variables (h+
i ,y,h−i ,y)
I max scoring latent value: h+i,y = argmax
h∈H〈w,Ψ(xi , y,h)〉
I min scoring latent value: h−i,y = argminh∈H
〈w,Ψ(xi , y,h)〉
• New scoring function:
Dw(xi , y) = 〈w,Ψ(xi , y,h+i ,y)〉+ 〈w,Ψ(xi , y,h−i ,y)〉 (2)
• Prediction function ⇒ find the output with maximum scorey = argmax
y∈YDw(xi , y) (3)
• MANTRA: max+min vs max for LSSVM ⇒ negativeevidence
7/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with themax scoring latent valuenot always relevant
MANTRA model:• Pair of latent variables (h+
i ,y,h−i ,y)
I max scoring latent value: h+i,y = argmax
h∈H〈w,Ψ(xi , y,h)〉
I min scoring latent value: h−i,y = argminh∈H
〈w,Ψ(xi , y,h)〉
• New scoring function:
Dw(xi , y) = 〈w,Ψ(xi , y,h+i ,y)〉+ 〈w,Ψ(xi , y,h−i ,y)〉 (2)
• Prediction function ⇒ find the output with maximum scorey = argmax
y∈YDw(xi , y) (3)
• MANTRA: max+min vs max for LSSVM ⇒ negativeevidence
7/35
MANTRA: Model & Training Rationale
Intuition of the max+min prediction function
• x image, h image region, y image class• 〈w,Ψ(x, y,h)〉: region h score for class y• Dw(x, y) = 〈w,Ψ(x, y,h+
y )〉+ 〈w,Ψ(x, y,h−y )〉I h+
y : presence of class y ⇒ large for yiI h−y : localized evidence of the absence of class y
I Not too low for yi ⇒ latent space regularizationI Low for y 6= yi ⇒ tracking negative evidence [Parizi, ICLR15]
street image x Dw(x, street) = 2 Dw(x, highway) = 0.7 Dw(x, coast) = −1.5
8/35
MANTRA: Model Training
Learning formulation
• Loss function: `w(xi , yi ) = maxy∈Y
[∆(yi , y) + Dw(xi , y)]− Dw(xi , yi )
I (Margin rescaling) upper bound of ∆(yi , y), constraints:
∀y 6= yi , Dw(xi , yi )︸ ︷︷ ︸score for ground truth output
≥ ∆(yi , y)︸ ︷︷ ︸margin
+ Dw(xi , y)︸ ︷︷ ︸score for other output
• Non-convex optimization problem
minw
12‖w‖2 +
C
N
N∑i=1
`w(xi , yi ) (4)
• Solver: non convex one slack cutting plane [Do, JMLR12]I Fast convergenceI Direct optimization 6= CCCP for LSSVMI Still needs to solve LAI: maxy [∆(yi , y) + Dw(xi , y)]
9/35
MANTRA: Optimization
• MANTRA Instantiation: define (x, y,h), Ψ(x, y,h), ∆(yi , y)
• Instantiations: binary & multi-class classification, AP rankingBinary Multi-class AP Ranking
x bag bag set of bags(set of regions) (set of regions) (of regions)
y ±1 1, . . . ,K ranking matrixh instance (region) region regions
Ψ(x, y, h) y · Φ(x, h)[1(y=1)Φ(x, h), . . . , joint latent ranking1(y=K)Φ(x, h)] feature map
∆(yi , y) 0/1 loss 0/1 loss AP lossLAI exhaustive exhaustive exact and efficient
• Solve Inference maxy Dw(xi , y) & LAI maxy [∆(yi , y) + Dw(xi , y)]I Exhaustive for binary/multi-class classificationI Exact and efficient solutions for ranking
10/35
WELDONWeakly supErvised Learning of Deep cOnvolutional Nets
• MANTRA extension for training deep CNNs• Learning Ψ(x, y,h): end-to-end learning of deep CNNs withstructured prediction and latent variables
I Incorporating multiple positive & negative evidenceI Training deep CNNs with structured loss
11/35
Standard deep CNN architecture: VGG16
Simonyan et al. Very deep convolutional networks for large-scale image recognition.ICLR 2015
12/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
13/35
MANTRA adaptation for deep CNN
Problem
• Fixed-size image as input
Adapt architecture to weakly supervised learning
1. Fully connected layers → convolution layersI sliding window approach
2. Spatial aggregationI Perform object localization prediction
13/35
WELDON: deep architecture
• C : number of classes14/35
Aggregation function
[Oquab, 2015]
• Region aggregation = max• Select the highest-scoring window
original image motorbike feature map max prediction
Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervisedlearning with convolutional neural networks. CVPR 2015 15/35
WELDON: region aggregation
Aggregation strategy:
• max+min pooling (MANTRA prediction function)• k-instances
I Single region to multiple high scoring regions:
max→ 1k
k∑i=1
i-thmax min→ 1k
k∑i=1
i-thmin
I More robust region selection [Vasconcelos CVPR15]
max max+min 3max+3min16/35
WELDON: architecture
17/35
WELDON: learning
• Objective function for multi-class task and k = 1:
minwR(w) +
1N
N∑i=1
`(fw(xi ), ygti )
fw(xi ) =argmaxy
(maxh
Lwconv(xi , y , h) + min
h′Lw
conv(xi , y , h′))
How to learn deep architecture ?
• Stochastic gradient descent training.• Back-propagation of the selecting windows error.
18/35
WELDON: learning
Class is present
• Increase score of selecting windows.
Figure: Car map
19/35
WELDON: learning
Class is absent
• Decrease score of selecting windows.
Figure: Boat map
20/35
Experiments
• VGG16 pre-trained on ImageNet• Torch7 implementation
Datasets
• Object recognition: Pascal VOC 2007, Pascal VOC 2012• Scene recognition: MIT67, 15 Scene• Visual recognition, where context plays an important role:COCO, Pascal VOC 2012 Action
VOC07/12 MIT67 15 Scene COCO VOC12 Action
21/35
Experiments
Dataset Train Test Classes ClassificationVOC07 ∼5.000 ∼5.000 20 multi-labelVOC12 ∼5.700 ∼5.800 20 multi-label15 Scene 1.500 2.985 15 multi-classMIT67 5.360 1.340 67 multi-classVOC12 Action ∼2.000 ∼2.000 10 multi-labelCOCO ∼80.000 ∼40.000 80 multi-label
22/35
Experiments
• Multi-scale: 8 scales (combination with Object Bank strategy)
23/35
Object recognition
VOC 2007 VOC 2012VGG16 (online code) [1] 84.5 82.8SPP net [2] 82.4Deep WSL MIL [3] 81.8WELDON 90.2 88.5
Table: mAP results on object recognition datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014[3] Oquab et al. Is object localization for free? CVPR 2015
24/35
Scene recognition
15 Scene MIT67VGG16 (online code) [1] 91.2 69.9MOP CNN [2] 68.9Negative parts [3] 77.1WELDON 94.3 78.0
Table: Multi-class accuracy results on scene categorization datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional ActivationFeatures. ECCV 2014[3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015
25/35
Context datasets
VOC 2012 action COCOVGG16 (online code) [1] 67.1 59.7Deep WSL MIL [2] 62.8Our WSL deep CNN 75.0 68.8
Table: mAP results on context datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015[2] Oquab et al. Is object localization for free? CVPR 2015
26/35
Visual results
Aeroplane model (1.8) Bus model (-0.4)
27/35
Visual results
Motorbike model (1.1) Sofa model (-0.8)
28/35
Visual results
Sofa model (1.2) Horse model (-0.6)
29/35
Visual results (failing examples)
Buffet Restaurant kitchen
30/35
Visual results (failing examples)
Kindergarden Classroom
31/35
Analysis
Impact of the different improvementsa) max b) +k=3 c) +min d) +AP VOC07 VOC12 action
X 83.6 53.5X X 86.3 62.6X X 87.5 68.4X X X 88.4 71.7X X X 87.8 69.8X X X X 88.9 72.6
• WSL detection results on VOC 2012 Action
max (a)) [Oquab, 2015] WELDONIoU 25.6 30.4
32/35
Analysis
• Impact of the number or regions k
k=1 k=3
33/35
Connections to others Latent Variables Models• Hidden CRF (HCRF) [Quattoni, PAMI07]
12‖w‖2 +
C
N
N∑i=1
log∑
(y,h)∈Y×H
exp〈w,Ψ(xi , y,h)〉 − log∑h∈H
exp〈w,Ψ(xi , yi ,h)〉
• Latent Structural SVM (LSSVM) [Yu, ICML09]
12‖w‖2 +
C
N
N∑i=1
max(y,h)∈Y×H
∆(yi ,y)+〈w,Ψ(xi ,y,h)〉 −maxh∈H〈w,Ψ(xi ,yi ,h)〉
• Marginal Structural SVM (MSSVM) [Ping, ICML14]
12‖w‖2+
C
N
N∑i=1
maxy
∆(yi ,y)+log
∑h∈H
exp〈w,Ψ(xi ,y,h)〉
−log
∑h∈H
exp〈w,Ψ(xi ,yi,h)〉
• WELDON
12‖w‖2+
C
N
N∑i=1
maxy
∆(yi ,y)+∑
h∈Ω⊆H〈w,Ψ(xi ,y,h)〉
−∑h∈Ω⊆H〈w,Ψ(xi , yi ,h)〉
34/35
Thibaut Durand Nicolas Thome Matthieu Cord
MLIA Team (Patrick Gallinari)Sorbonne Universités - UPMC Paris 6 - LIP6
MANTRA project pagehttp://webia.lip6.fr/~durandt/project/mantra.html
Thibaut Durand, Nicolas Thome, and Matthieu Cord.MANTRA: Minimum Maximum LSSVM for Image Classification and Ranking.In IEEE International Conference on Computer Vision (ICCV), 2015.
Thibaut Durand, Nicolas Thome, and Matthieu Cord.WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
35/35