future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf ·...
TRANSCRIPT
![Page 1: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/1.jpg)
Future directions in
computer visionLarry Davis
Computer Vision Laboratory
University of Maryland
College Park MD USA
![Page 2: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/2.jpg)
Presentation overview
Future Directions Workshop on Computer
Vision
Object detection using CNN’s without object
proposals
Incorporating context onto detection
Scale dependent pooling to detect small
object instances
Resolving referring expressions using context
Summary
![Page 3: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/3.jpg)
Strategic Directions Workshop on
“Visual Commonsense” Nov 12-
13 in D.C.
• Sponsored by OSTP in the US
• Poggio, Malik, Zhu, Berg (Alex), Kohli, Hoeim,
Grauman, Zitnick, Gupta, Fox, Tellex, Oliva,
Scholl (absent), Domingos, Daume.
• Organized by me, Fei Fei Li and Devi Parikh
![Page 4: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/4.jpg)
The computer vision landscape
• Breakthroughs in CV (and AI generally) would
clearly be “disruptive.” This has been known
“forever.”
• Our field has more than doubled in size in less
than a decade and there are currently more than
175 startups in computer vision worldwide
according to chrunchbase.
• Feeding frenzy in self driving cars
• So, has the field finally progressed to the point
where real vision problems can be solved?
![Page 5: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/5.jpg)
So, what has changed?Deep learning
![Page 6: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/6.jpg)
So, what has changed?Deep learning
SFM and stereo
![Page 7: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/7.jpg)
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
![Page 8: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/8.jpg)
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
Computing infrastructure
Big Data
Crowd sourcing
GPU’s
Cloud computing and “free” storage
![Page 9: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/9.jpg)
So, what has changed?Deep learning
SFM and stereo
Human pose estimation and tracking
Computing infrastructure
Big Data
Crowd sourcing
GPU’s
Cloud computing and “free” storage
Open source software
![Page 10: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/10.jpg)
Commercial indicators
Driving aids and autonomous driving - Mobileye
Face recognition under the hood
at social media companies
Image search – Tineye,
Clarifai
Google self driving cars – 1.5 M miles and going
![Page 11: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/11.jpg)
And what about the next
10?So what do you think the future of the field
is?
Here are some of the workshop
recommendations.
![Page 12: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/12.jpg)
Workshop
recommendationsDevelop the field of “social perception”
Understand the “internal state” of people as they interact with each other and with the world
Crucial for human robot interaction
Perceptual Robotics – and testbeds for measurement of progress in situated vision research.
Visual Search – intelligent sampling of the visual world
Acquisition and Representation of Visual Commonsense from Observation and Interaction
Vision and Language
![Page 13: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/13.jpg)
Language and vision - How to test ability to
accumulate and integrate knowledge?VQA Dataset
• Many useful challenges
– Where to look to answer a question?
– How to relate existing detectors, pose estimators, attribute classifiers, etc. to this task?
– How to combine general knowledge with vision?
![Page 14: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/14.jpg)
Workshop
recommendations
Structured prediction
Relationship between parts, objects and
scenes
The hierarchical structure of human
behavior- movement, goals, actions and
events
“Explainable” perception. Don’t just classify, explain your answer
![Page 15: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/15.jpg)
Workshop
recommendationsDeep learning.
Why/when does it work?
Why are all local minima created equal?
Visual learning with minimal (no) supervision
Developmental learning (NEIL)
![Page 16: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/16.jpg)
Are object proposals
necessarily the answer?
G-CNN – an iterative grid based object detector
Mahyar Najibi and Mohammad Rastegari
CVPR 2016
![Page 17: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/17.jpg)
Object detection
Localization – bounding box, segmentation
masks
Classification
![Page 18: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/18.jpg)
In your camera – sliding window
detection
Sliding Window
Extracted Boxes
Multi class
Classifier
horse = 0.6
horse = 0.0
horse = 0.0
horse = 0.5
horse = 0.9
person = 0.3
person = 0.5
person = 0.8
person = 0.9
person = 0.0
![Page 19: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/19.jpg)
Object proposals
Sliding windows are slow – scale, orientation, ..
Object proposals are (learning-based) multi- segmentation algorithms that generate fewer regions for classification (typically boxes).
Consensus is that region proposals are crucial to SOA detection systems whether they are given to the network or constructed by the network
However localization is poor, so (class-dependent) post-processing is typically employed
Regressor
![Page 20: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/20.jpg)
Object proposals and CNN’sR-CNN - push each proposal through the CNN; slow because the
network is run multiple times.
SPP-Net [1] computes filter responses only once for each image and
pools from them to form features for the proposals.
Fast R-CNN [2] builds on this and packs all stages of the system
except the region proposal into one CNN.
Fast R-CNN
1. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition."
Computer Vision–ECCV 2014. Springer International Publishing, 2014. 346-361
2. Girshick, Ross. "Fast R-CNN." ICCV (2015).
![Page 21: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/21.jpg)
Region Proposal Stage
These methods use an external object proposalstage (e.g. selective search with ~2Kproposals/image)
In Fast R-CNN, computing object proposals is thebottleneck, taking around 2 sec/image time.
Faster R-CNN [3] increases efficiency by reducingthe number of proposed bounding boxes.
Jointly learns proposal generator and features
Fast and accurate3. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal
networks." NIPS (2015).
![Page 22: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/22.jpg)
G-CNN Training Network
Structure
![Page 23: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/23.jpg)
G-CNN: Training
Training set for step 1
![Page 24: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/24.jpg)
G-CNN: Training
Added samples for step 2
![Page 25: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/25.jpg)
G-CNN Detection
*The highest
scoring class
is car.
*The highest
scoring class
is car.
CarRegressor
*The highest
scoring class
is car
CarRegressor
*The highest
scoring class
is car.
CarRegressor
Iteratively update the position of the initial bounding boxes with the
regressor corresponding to the class with the highest score.
![Page 26: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/26.jpg)
G-CNN structure in detection
time
To reduce detection time, the G-CNN network is
divided into two parts:
• The global part is called only once for each image.
• The regression part is called Stest times, one for each
step.
![Page 27: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/27.jpg)
Experimental Setup
• Experiments are performed on VOC 2007 and VOC
2012 datasets.
• G-CNN is trained with S=3 steps over an initial grid
with three scales [2,5,10] and overlaps [0.9,0.8,0.7] at
each scale.
• At test time, use a coarser grid with overlaps
[0.7,0.5,0.0] (around 180 initial boxes)
• after 5 iterations achieves the same mAP as Fast
R-CNN with around 2K bounding boxes.
![Page 28: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/28.jpg)
VOC2012 using VGG16
![Page 29: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/29.jpg)
How effective are the
regressors?IoU histogram of the best overlapping boxes to ground truth
boxes at each iteration.
![Page 30: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/30.jpg)
![Page 31: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/31.jpg)
How can a neural network
learn and utilize context?
Mahyar Najibi, Mohammad Rastegari,
Abhinav Gupta, Ali Farhadi – Deep
Saccadic Detectors
![Page 32: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/32.jpg)
Top choices of FRCNN are very accurate
![Page 33: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/33.jpg)
Detection with GTS
Method Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow
FRCNN SS 66.4 71.6 53.8 43.3 24.7 69.2 69.7 71.5 31.1 63.4
FRCNN SS+GT 68.2 74.1 56 50.6 31.5 72.6 72.8 73.1 34.8 63.8
FRCNN GT 83 84.1 78.7 81.5 73.7 85.5 88 83.5 69.9 75.4
Dining table Dog Horse Motorbike Person Potted Plant Sheep Sofa Train TV Monitor Average
59.8 62.2 73.1 65.9 57 26 52 56.4 67.8 57.7 57.1
61.1 63.5 76.8 68.5 63.7 29.7 54.4 57.8 70.5 61 60.2
80.2 78.1 81.9 85.1 87.7 83.2 71.7 78.5 88.8 88.5 81.3
Methods are trained on VOC2007 trainval.
AlexNet is employed as the CNN structure.
FRCNN GT: Only GT boxes are
used.
FRCNN SS: Fast RCNN using
selective search proposals.
FRCNN SS+GT: GT boxes are
added to SS boxes.
![Page 34: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/34.jpg)
Sequential detection
This suggests a simple strategy for
detection
Commit to the most confident
detection
Use it as context for determining the
next most confident detection,
And so on
All integrated into one CNN architecture
![Page 35: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/35.jpg)
Deep Sequential DetectionInput
Convolutional
Layers
RO
I P
ooli
ng
ROI info
Lin
ear (
fc6)
ReL
U
Lin
ear
(fc7
)
ReL
U
Reg
ress
or
Act
ive
Sel
ecto
r
Lin
ear
(h1)
Cla
ss-b
ase
d
ou
tpu
t
ReL
U
Lin
ear
(h2)
ReL
U
Cla
ssif
ier
Hidden
State
Selector
Con
cat
Cla
ssif
ica
tio
n
Ou
tpu
t
Active
select input
Hidden
select input
MAX
NMS
![Page 36: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/36.jpg)
Datasets• Pascal VOC2007
• 20 Classes
• ~10K images
• Pascal VOC2012
• 20 Classes
• ~15K images
• MSCOCO (2015)
• 80 Classes
• ~300K images
![Page 37: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/37.jpg)
VOC 2012
![Page 38: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/38.jpg)
MSCOCO
Precision and Recall
Methods are trained on the train-set and evaluated on the validation-set.
Top 2K selective search proposals are used for the methods.
Class-based Relative Improvement
![Page 39: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/39.jpg)
Scale dependent pooling – Fan
Yang (CVPR 2016)
Goal - detect (even small) objects effectively
and efficiently using CNNs + object
proposals
61
scale variancehuge number
of proposals
![Page 40: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/40.jpg)
Scale-dependent pooling
Pool proposals of different scales from different
conv layers: n-branch structure
Small instances of objects are well represented using
features pooled from lower conv layers
![Page 41: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/41.jpg)
Scale-dependent poolingDivide proposals into groups based on their size
Pool small proposals at lower conv layers and larger
ones at higher conv layers
Train the entire system end-to-end
small proposal
s
Pooling Pooling
large proposal
s
![Page 42: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/42.jpg)
ExperimentsKITTI (mAP)
Inner-city (mAP)
![Page 43: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/43.jpg)
Detection as a function of size - Kitti
Car Pedestrian Cyclist mAP
Methods Inputs S1 S2 S3 S4 S S1 S2 S3 S4 S S1 S2 S3 S4 S S
FRCNN+AlexN
et4 52.8 60.7 75.8 55.5 61.6 19.7 47.5 88.4 24.1 61.4 42 51.6 44.9 0 46.5 56.5
FRCNN+VGG1
6
1
(400) 33.9 68.3 82.8 68.8 57.3 7.9 50.4 95.3 55.8 64.6 19 63.8 66.6 0 42.3 54.7
1
(500) 42.2 70 85.1 65.9 62.3 12.6 55.9 94.6 44.9 66.8 29.1 63.8 68.7 0 48.8 59.3
1
(800) 47.6 70 84.8 60.5 64.5 14.7 54.5 94.5 47.2 66.4 34.9 61.2 67.4 0 50.4 60.4
2 47.4 70.2 83.1 54.5 64.1 14.9 55.2 94.5 63.1 66.5 35.8 61.2 65.9 0 50.4 60.3
SDP
1
(400)59.1 73.8 84.7 73.6 70.7 12.6 54.8 94.9 70.7 65.7 29.3 65.6 71.7 0 49.4 61.9
1
(500)64.2 74.4 86 68.4 73.7 17.3 58.4 94.9 44.8 66.9 37.5 67.3 68.6 0 54.6 65.1
1
(800)65.2 73.5 86 61 73.8 16.9 57.1 94.3 44.1 65.5 36.5 61.5 61.9 0 49.9 63.1
SDP+CRC1
(500)63.9 74.3 85.8 68.2 73.5 17.5 52 93.7 45.9 65.5 35.1 65.7 69.2 0 52.9 64
SDP+CRC ft1
(500) 63.9 74.2 85.5 62.9 73.7 17.6 50 93.4 61 65.9 35.8 66.5 67.6 0 53.1 64.2
![Page 44: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/44.jpg)
Modeling Context between Objects for
Understanding Referring Expressions
Varun Nagaraja, Vlad Morariu, Larry
Davis
ECCV 2016
![Page 45: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/45.jpg)
Man sitting on the left holding a game controller
Woman in the middle sitting on the bed
Man wearing a red jacket and blue jeans sitting on the right
Descriptions that identify a particular object instance
Referring Expressions
![Page 46: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/46.jpg)
Referring expressions rely on attributes and context
Blonde fluffy dog
Tan colored sofa
Giraffe bending down
Person riding a blue motorcycle
Plant on the right side of the TV
![Page 47: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/47.jpg)
Problem Formulation
Sentence:Girl wearing a red jacket
Image I
Input Output
![Page 48: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/48.jpg)
Solution Framework
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Hypothesize a set of region candidates
![Page 49: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/49.jpg)
Solution Framework
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Pick the region candidate with the highest probability
of generating the query referring expression
![Page 50: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/50.jpg)
Baseline Method
LST
M
unit
Girl
<BOS>
Region CNN features
Image CNN features
Bounding box features
Word
embedding
wearingLST
M
unit
Girl
aLST
M
unit
wearing
red
LST
M
unit
a
jacket
LST
M
unit
red
LST
M
unit
<EOS>
jacket
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Modeling referring expression probability using an LSTM
![Page 51: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/51.jpg)
Max-margin Method
The baseline method can be improved by training the
model to have lower probability for negative regions
Referred region Negative regions
Generation and Comprehension of Unambiguous Object Descriptions
J. Mao et al., CVPR 2016
Girl wearing a red jacket
![Page 52: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/52.jpg)
Modeling Context
The plant on the right side of the TV
Previous methods do not model locations of contextual
objects
![Page 53: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/53.jpg)
Modeling Context
LSTM
Word Embedding
Region CNN features
Image features
Baseline and Max-margin architecture
Region BBox
![Page 54: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/54.jpg)
Modeling Context
Context model architecture
LSTM
Word Embedding
Region CNN features
Context region features
Region BBox
Context region BBox
![Page 55: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/55.jpg)
LSTM
Word Embedding
Region1 CNN features
Region2 CNN features
Region1 BBox
Region2 BBox
Region1
Region2
Modeling Context
![Page 56: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/56.jpg)
LSTM
Word Embedding
Region1 CNN features
Region3 CNN features
Region1 BBox
Region3 BBox
Region1
Region3
Modeling Context
![Page 57: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/57.jpg)
LSTM
Word Embedding
Region1 CNN features
Region4 CNN features
Region1 BBox
Region4 BBox
Region1Region4
Modeling Context
![Page 58: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/58.jpg)
Modeling Context
Pooling context from multiple pairs of regions
![Page 59: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/59.jpg)
Modeling Context
We can also use noisy-or pooling which is more robust
Noisy-or
Noisy-or
![Page 60: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/60.jpg)
Training the Context Model
The challenge is that there are no annotations available for
context objects
The plant on the right side of the TV
![Page 61: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/61.jpg)
Multiple Instance Learning
So we use a MIL based technique and use the annotation
of the referred object as weak supervision
The plant on the right side of the TV
![Page 62: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/62.jpg)
Experiments
Implemented in Caffe
Region and Image features
• VGG16 fc8 layer - fine-tuned.
Bounding box features
• scaled <xmin, ymin, xmax, ymax, area>
Word embedding size – 1024
LSTM hidden dimension – 1024
Region candidates – MCG technique
Region filtering process
• Obtain scores from Fast-RCNN and select regions above a
threshold
![Page 63: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/63.jpg)
Google RefExp Results
Method \ Proposals GT MC
G
Max Likelihood [Mao et al] 57.5 42.4
Max margin [Mao et al] 65.7 47.8
Ours, Neg. Bag margin 68.4 49.5
Ours, Pos. & Neg. Bag
margin
68.4 50.0
All results are from noisy-or pooling
A detection is considered true positive if the IOU score is greater than 0.5
Google RefExp Validation Partition
![Page 64: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/64.jpg)
Google RefExp Results
The chair closest to the lady
Groundtruth Image context only Noisy-or pooling
A white truck in front of a yellow truck
![Page 65: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/65.jpg)
UNC RefExp Results
Method \ Proposals GT MCG
Max Likelihood [Mao et al] 70.6 50.0
Max margin [Mao et al] 76.3 55.1
Ours, Neg. Bag margin 78.0 56.4
Ours, Pos. & Neg. Bag
margin
76.1 56.3
TestB Partition (Object centric)
![Page 66: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/66.jpg)
UNC RefExp Results
Groundtruth Image context only Noisy-or pooling
Elephant towards the back
Food on the far back on the plate
TestB Partition (Object centric)
![Page 67: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/67.jpg)
A few closing observationsSuccess depends on region proposal algorithm including
candidates for the correct referred and context objects
Much more demanding than just requiring a candidate
for the referred object
Ameliorated somewhat by having the entire image as a
candidate context object
Straightforward extension to include additional context
objects (language can be deeply nested) intractable
(Methodological) – would like to evaluate performance
restricted to “relevant” referring expressions, but difficult
to specify correct criteria for selection
![Page 68: Future directions in computer visiongibis.unifesp.br/sibgrapi16/eproceedings/upload/is/2.pdf · SPP-Net [1] computes filter responses only once for each image and pools from them](https://reader033.vdocuments.mx/reader033/viewer/2022041506/5e2529df461a7e4f7719b694/html5/thumbnails/68.jpg)
SummaryIntellectual landscape of computer vision has changed dramatically over the past decade
Many of the “future research directions” identified by the workshop are already well underway
And there are still huge performance shortfalls on basic problems like detection and recognition (compare MSCOCO vs VOC)
My favorite future research directions
Context – sooner or later it has to make a difference
Visual search
Tasking visual surveillance systems – compositional models and video analysis (structured prediction)