grounding of textual phrases in images by reconstruction ...lsigal/532l/pp_groundingof... · rec...

29
Introduction Methodology Discussions Grounding of Textual Phrases in Images by Reconstruction (ECCV, 2016) Anna Rohrbach 1 Marcus Rohrbach 2,3 Ronghang Hu 2 Trevor Darrell 2 Bernt Schiele 1 1 Max Planck Institute for Informatics, Saarbr¨ ucken, Germany 2 UC Berkeley EECS, CA, United States 3 ICSI, Berkeley, CA, United States Presenters: Jiaxuan Chen, Meng Li March 1, 2018 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt Schiele Grounding of Textual Phrases in Images by Reconstruction

Upload: others

Post on 09-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

Grounding of Textual Phrases in Images byReconstruction

(ECCV, 2016)

Anna Rohrbach1 Marcus Rohrbach2,3 Ronghang Hu2

Trevor Darrell2 Bernt Schiele1

1Max Planck Institute for Informatics, Saarbrucken, Germany

2UC Berkeley EECS, CA, United States

3ICSI, Berkeley, CA, United States

Presenters: Jiaxuan Chen, Meng LiMarch 1, 2018

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 2: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 3: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

Motivation & Problem

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 4: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

Motivation & Problem

Motivation

I Language grounding (i.e.localizing) in visual data is achallenging problem.

I Prior studies focus on constrained settings with a smallnumber of nouns to ground.

I Few datasets provide the ground truth localization of phrases

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 5: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

Motivation & Problem

Problem of this paper

I Problem:Given images paired with natural language phrases, groundingarbitrary textual phrases in images, with no, a few, or allbounding box annotations available.

I Relation to other works:This paper uses CNN, LSTM, soft-attention mechanism and abi-directional mapping (Phrase ⇒ Attended Image Region ⇒Reconstructed Phrase).

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 6: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 7: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Training time

1. Given a phrase, attend to the mostrelevant region (or potentially alsomultiple regions) based on thephrase

2. Reconstruct the same phrase pfrom region(s) it attended to in thefirst phase

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 8: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Test time

1. Given a phrase, attend to the mostrelevant region based on the phrase

2. Calculate the IOU (intersectionover union) value between theselected box and the ground truthbox.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 9: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Model Overview

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 10: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to groundI Generate a set of image region

proposals {ri}i=1,...,N

I Encode phrase and proposal boxesh = fLSTM(p), υi = fCNN(ri )

I Apply Attention

1. Attention weight:αi = fATT (p, ri ) =W2 ∗Relu(Whh+Wυυi +b1) +b2

2. Normalized attention weightαi = softmax(αi )

3. E.g. α1 = 0.2, α2 = 0.7, α3 = 0.1

I Supervised training lossLATT = − 1

B

∑Bb=1 log(αj), where

rj is the correct proposal box

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 11: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to groundI Generate a set of image region

proposals {ri}i=1,...,N

I Encode phrase and proposal boxesh = fLSTM(p), υi = fCNN(ri )

I Apply Attention

1. Attention weight:αi = fATT (p, ri ) =W2 ∗Relu(Whh+Wυυi +b1) +b2

2. Normalized attention weightαi = softmax(αi )

3. E.g. α1 = 0.2, α2 = 0.7, α3 = 0.1

I Supervised training lossLATT = − 1

B

∑Bb=1 log(αj), where

rj is the correct proposal box

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 12: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to groundI Generate a set of image region

proposals {ri}i=1,...,N

I Encode phrase and proposal boxesh = fLSTM(p), υi = fCNN(ri )

I Apply Attention

1. Attention weight:αi = fATT (p, ri ) =W2 ∗Relu(Whh+Wυυi +b1) +b2

2. Normalized attention weightαi = softmax(αi )

3. E.g. α1 = 0.2, α2 = 0.7, α3 = 0.1

I Supervised training lossLATT = − 1

B

∑Bb=1 log(αj), where

rj is the correct proposal box

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 13: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to reconstruct

I Encode visual features

1. υATT =∑N

i=1 αiυi= 0.2v1 + 0.7v2 + 0.1v3

2. υ′ATT = fREC (υATT ) =Relu(WaυATT + ba)

I Generated distribution over pP(p|υ′ATT ) = fLSTM(υ′ATT )

I Unsupervised training lossLREC = − 1

B

∑Bb=1 log(P(p|υ′ATT )),

where p is the ground-truth phrase

I Semi-supervised training lossL = λLATT + LREC

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 14: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to reconstruct

I Encode visual features

1. υATT =∑N

i=1 αiυi= 0.2v1 + 0.7v2 + 0.1v3

2. υ′ATT = fREC (υATT ) =Relu(WaυATT + ba)

I Generated distribution over pP(p|υ′ATT ) = fLSTM(υ′ATT )

I Unsupervised training lossLREC = − 1

B

∑Bb=1 log(P(p|υ′ATT )),

where p is the ground-truth phrase

I Semi-supervised training lossL = λLATT + LREC

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 15: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to reconstruct

I Encode visual features

1. υATT =∑N

i=1 αiυi= 0.2v1 + 0.7v2 + 0.1v3

2. υ′ATT = fREC (υATT ) =Relu(WaυATT + ba)

I Generated distribution over pP(p|υ′ATT ) = fLSTM(υ′ATT )

I Unsupervised training lossLREC = − 1

B

∑Bb=1 log(P(p|υ′ATT )),

where p is the ground-truth phrase

I Semi-supervised training lossL = λLATT + LREC

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 16: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Learning to reconstruct

I Encode visual features

1. υATT =∑N

i=1 αiυi= 0.2v1 + 0.7v2 + 0.1v3

2. υ′ATT = fREC (υATT ) =Relu(WaυATT + ba)

I Generated distribution over pP(p|υ′ATT ) = fLSTM(υ′ATT )

I Unsupervised training lossLREC = − 1

B

∑Bb=1 log(P(p|υ′ATT )),

where p is the ground-truth phrase

I Semi-supervised training lossL = λLATT + LREC

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Jiaxuan Chen
Page 17: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 18: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Experimental Setup I

Datasets: Flickr 30k Entities and ReferItGame

I Flickr 30k Entities: over 275K bounding boxes from 31Kimages with natural language phrases

I ReferItGame: over 99K regions from 20K images with naturallanguage expressions

Generate 100 bounding box proposals for each image using

I Selective Search for Flickr 30k Entities

I Edge Boxes for ReferItGame

For semi- and fully supervised models, the ground-truth attentionis obtained by selecting the proposal box that overlaps most withthe ground-truth box.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 19: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Experimental Setup II

Flickr 30k Entities

I VGG-CLS: VGG16 network trained on ImageNet

I VGG-DET: VGG16 network fine-tuned for object detection onPASCAL, trained using Fast R-CNN

ReferItGame

I VGG+SPAT: VGG-CLS features and additional spatialfeatures provided by Hu et al (2016).

Test time evaluation: the ratio of phrases for which the attendedbox overlaps with the ground-truth box by more than 0.5 IOU.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 20: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 21: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Experiments on Flickr 30k Entities dataset

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 22: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Experiments on ReferItGame dataset

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 23: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Qualitative results on Flickr 30k Entities

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 24: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

GroundeRExperimental SetupResults

Qualitative results on ReferItGame

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 25: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

CommentsPotential Extensions

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 26: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

CommentsPotential Extensions

Comments

Contribution:

I Applicability to un-, semi-, and supervised training regimes

Limitations:

I Ignore the object relationships

I Select a single proposal box that overlaps the most withground-truth box but not the union

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 27: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

CommentsPotential Extensions

Outline

IntroductionMotivation & Problem

MethodologyGroundeRExperimental SetupResults

DiscussionsCommentsPotential Extensions

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 28: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

CommentsPotential Extensions

Potential Extensions

I Fine tune VGG

I Apply attention on input phrases

I Different language model

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction

Page 29: Grounding of Textual Phrases in Images by Reconstruction ...lsigal/532L/PP_GroundingOf... · REC Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding

IntroductionMethodologyDiscussions

CommentsPotential Extensions

Thanks for your attention!

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu , Trevor Darrell, Bernt SchieleGrounding of Textual Phrases in Images by Reconstruction