arxiv:1604.02125v3 [cs.cv] 2 sep 2016 · resolving vision and language ambiguities together: joint...

Resolving Vision and Language Ambiguities Together: Joint Segmentation& Prepositional Attachment Resolution in Captioned Scenes

Gordon Christie1,∗, Ankit Laddha2,∗, Aishwarya Agrawal1, Stanislaw Antol1Yash Goyal1, Kevin Kochersberger1, Dhruv Batra1

1Virginia Tech 2Carnegie Mellon University{gordonac,aish,santol,ygoyal,kbk,dbatra}@vt.edu

[email protected]

Abstract

We present an approach to simultaneouslyperform semantic segmentation and preposi-tional phrase attachment resolution for cap-tioned images. Some ambiguities in languagesimply cannot be resolved without simultane-ously reasoning about an associated image. Ifwe consider the sentence “I shot an elephantin my pajamas”, looking at language alone(and not reasoning about common sense), itis unclear if it is the person or the elephantthat is wearing the pajamas or both. Ourapproach produces a diverse set of plausi-ble hypotheses for both semantic segmenta-tion and prepositional phrase attachment res-olution that are then jointly re-ranked to selectthe most consistent pair. We show that our se-mantic segmentation and prepositional phraseattachment resolution modules have comple-mentary strengths, and that joint reasoningproduces more accurate results than any mod-ule operating in isolation. Multiple hypothe-ses are also shown to be crucial to improvedmultiple-module reasoning. Our vision andlanguage approach significantly outperformsthe Stanford Parser (De Marneffe et al., 2006)by 17.91% (28.69% relative) in one experi-ment, and by 12.83% (25.28% relative) in an-other. We also make small improvements overa vision system (Chen et al., 2015).

1 Introduction

Perception and intelligence problems are hard.Whether we are interested in understanding an im-

* Denotes equal contribution

PASCAL Sentence Dataset

Consistent

NLP: Sentence Parsing

Ambiguity: (woman on couch) vs (dog on couch)

Output: Parse Tree

“A dog is standing next to a woman on a couch”

Vision: Semantic SegmentationLabels: Chairs, desks, etc.

Person

Couch

Couch

Person

Dog

Solution#1

Solution#M

Ambiguity: (dog next to woman) on couchvs dog next to (woman on couch)

Ambiguity: (dog next to woman) on couchvs dog next to (woman on couch)

Figure 1: Overview of our approach. We propose a modelfor simultaneous 2D semantic segmentation and preposi-tional phrase attachment resolution by reasoning aboutsentence parses. The language and vision modules eachproduces M diverse hypotheses, and the goal is to selecta pair of consistent hypotheses. In this example the am-biguity to be resolved from the image caption is whetherthe dog is standing on or next to the couch. Both modulesbenefit by selecting a pair of compatible hypotheses.

age or a sentence, our algorithms must operate un-der tremendous levels of ambiguity. When a hu-man reads the sentence “I eat sushi with tuna”, itis clear that the preposition phrase “with tuna” mod-ifies “sushi” and not the act of eating, but this maybe ambiguous to a machine. This problem of deter-mining whether a prepositional phrase (“with tuna”)modifies a noun phrase (“sushi”) or verb phrase(“eating”) is formally known as Prepositional PhraseAttachment Resolution (PPAR) (Ratnaparkhi et al.,

arX

iv:1

604.

0212

5v3

[cs

.CV

] 2

Sep

201

6

1994). Consider the captioned scene shown in Fig-ure 1. The caption “A dog is standing next to awoman on a couch” exhibits a PP attachment am-biguity – “(dog next to woman) on couch” vs “dognext to (woman on couch)”. It is clear that havingaccess to image segmentations can help resolve thisambiguity, and having access to the correct PP at-tachment can help image segmentation.

There are two main roadblocks that keep us fromwriting a single unified model (say a graphicalmodel) to perform both tasks: (1) Inaccurate Models– empirical studies (Meltzer et al., 2005, Szeliski etal., 2008, Kappes et al., 2013) have repeatedly foundthat models are often inaccurate and miscalibrated– their “most-likely” beliefs are placed on solutionsfar from the ground-truth. (2) Search Space Explo-sion – jointly reasoning about multiple modolitiesis difficult due to the combinatorial explosion ofsearch space ({exponentially-many segmentations}× {exponentially-many sentence-parses}).

Proposed Approach and Contributions. In thispaper, we address the problem of simultaneous ob-ject segmentation (also called semantic segmenta-tion) and PPAR in captioned scenes. To the best ofour knowledge this is the first paper to do so.

Our main thesis is that a set of diverse plausiblehypotheses can serve as a concise interpretable sum-mary of uncertainty in vision and language ‘mod-ules’ (What does the semantic segmentation mod-ule see in the world? What does the PPAR mod-ule describe?) and form the basis for tractable jointreasoning (How do we reconcile what the semanticsegmentation module sees in the world with how thePPAR module describes it?).

Given our two modules with M hypotheses each,how can we integrate beliefs across the segmenta-tion and sentence parse modules to pick the bestpair of hypotheses? Our key focus is consistency– correct hypotheses from different modules will becorrect in a consistent way, but incorrect hypotheseswill be incorrect in incompatible ways. Specifically,we develop a MEDIATOR model that scores pairs forconsistency and searches over all M2 pairs to pickthe highest scoring one. We demonstrate our ap-proach on three datasets – ABSTRACT-50S (Vedan-tam et al., 2014), PASCAL-50S, and PASCAL-Context-50S (Mottaghi et al., 2014). We show thatour vision+language approach significantly outper-

forms the Stanford Parser (De Marneffe et al., 2006)by 20.66% (36.42% relative) for ABSTRACT-50S,17.91% (28.69% relative) for PASCAL-50S, andby 12.83% (25.28% relative) for PASCAL-Context-50S. We also make small but consistent improve-ments over DeepLab-CRF (Chen et al., 2015).

2 Related Work

Most works at the intersection of vision and NLPtend to be ‘pipeline’ systems, where vision taskstake 1-best inputs from NLP (e.g., sentence pars-ings) without trying to improve NLP performanceand vice-versa. For instance, Fidler et al. (2013)use prepositions to improve object segmentation andscene classification, but only consider the most-likely parse of the sentence and do not resolve ambi-guities in text. Analogously, Yatskar et al. (2014) in-vestigate the role of object, attribute, and action clas-sification annotations for generating human-like de-scriptions. While they achieve impressive results atgenerating descriptions, they assume perfect visionmodules to generate sentences. Our work uses cur-rent (still imperfect) vision and NLP modules to rea-son about images and provided captions, and simul-taneously improve both vision and language mod-ules. Perhaps the work closest to us is due to Konget al. (2014), who leverages information from anRGBD image and its sentential description to im-prove 3D semantic parsing and resolve ambiguitiesrelated to co-reference resolution in the sentences(e.g., what “it” refers to). We focus on a differentkind of ambiguity – the Prepositional Phrase (PP) at-tachment resolution. In the classification of parsingambiguities, co-reference resolution is considereda discourse ambiguity (Poesio and Artstein, 2005)(arising out of two different words across sentencesfor the same object), while PP attachment is consid-ered a syntactic ambiguity (arising out of multiplevalid sentence structures) and is typically consideredmuch more difficult to resolve (Bach, , Davis, ).

A number of recent works have studied problemsat the intersection of vision and language, such asVisual Question Answering (Antol et al., 2015, Ge-man et al., 2014, Malinowski et al., 2015), Vi-sual Madlibs (Yu et al., 2015), and image caption-ing (Vinyals et al., 2015, Fang et al., 2015). Ourwork falls in this domain with a key difference that

we produce both vision and NLP outputs.Our work also has similarities with works on

‘spatial relation learning’ (Malinowski and Fritz,2014a, Lan et al., 2012a), i.e. learning a visual rep-resentation for noun-preposition-noun triplets (“caron road”). While our approach can certainly utilizesuch spatial relation classifiers if available, the focusof our work is different. Our goal is to improve se-mantic segmentation and PPAR by jointly rerankingsegmentation-parsing solution pairs. Our approachimplicitly learns spatial relationships for preposi-tions (“on”, “above”) but these are simply emergentlatent representations that help our reranker pick outthe most consistent pair of solutions.

3 Approach

In order to emphasize the generality of our approach,and to show that our approach is compatible with awide class of implementations of semantic segmen-tation and PPAR modules, we present our approachwith the modules abstracted as “black boxes” thatsatisfy a few general requirements and minimal as-sumptions. In Section 4, we describe each of themodules in detail, making concrete their respectivefeatures, and other details.

3.1 What is a Module?

The goal of a module is to take input variablesx ∈ X (images or sentences), and predict out-put variables y ∈ Y (semantic segmentation) andz ∈ Z (prepositional attachment expressed in sen-tence parse). The two requirements on a module arethat it needs to be able to produce scores S(y|x) forpotential solutions and a list of plausible hypothesesY = {y1,y2, . . . ,yM}.

Multiple Hypotheses. In order to be useful, theset Y of hypotheses must provide an accurate sum-mary of the score landscape. Thus, the hypothesesshould be plausible (i.e., high-scoring) and mutu-ally non-redundant (i.e., diverse). Our approach (de-scribed next) is applicable to any choice of diversehypothesis generators. In our experiments, we usethe k-best algorithm of (Huang and Chiang, 2005)for the sentence parsing module and the DivMBestalgorithm (Batra et al., 2012) for the semantic seg-mentation module. Once we instantiate the modulesin the Experiment section, we describe the diverse

solution generation in more detail.

3.2 Joint Reasoning Across Multiple Modules

We now show how to intergrate information fromboth segmentation and PPAR modules. Recall thatour key focus is consistency – correct hypothesesfrom different modules will be correct in a consis-tent way, but incorrect hypotheses will be incorrectin incompatible ways. Thus, our goal is to searchfor a pair (semantic segmentation, sentence parsing)that is mutually consistent.

Let Y = {y1, . . . ,yM} denote the M seman-tic segmentation hypotheses and Z = {z1, . . . , zM}denote the M PPAR hypotheses.

MEDIATOR Model. We develop a “mediator”model that identifies high-scoring hypotheses acrossmodules in agreement with each other. Concretely,we can express the MEDIATOR model as a fac-tor graph where each node corresponds to a mod-ule (semantic segmentation and PPAR). Workingwith such a factor graph is typically completely in-tractable because each node y, z has exponentially-many states (image segmentations, sentence pars-ing). As illustrated in Figure 2, in this factor-graphview, the hypothesis sets Y,Z can be considered‘delta-approximations’ for reducing the size of theoutput spaces.

Unary factors S(·) capture the score/likelihoodof each hypothesis provided by the correspondingmodule for the image/sentence at hand. Pairwisefactors C(·, ·) represent consistency factors. Impor-tantly, since we have restricted each module vari-ables to just M states, we are free to capture ar-bitrary domain-specific high-order relationships forconsistency, without any optimization concerns. Infact, as we describe in our experiments, these con-sistency factors may be designed to exploit domainknowledge in fairly sophisticated ways.

Consistency Inference. We perform exhaustiveinference over all possible tuples.

argmaxi,j∈{1,...,M}

{M(yi, zj) = S(yi) + S(zj) + C(yi, zj)

}.

(1)

Notice that the search space with M hypotheseseach isM2. In our experiments, we allow each mod-ule to take a different value for M , and typicallyuse around 10 solutions for each module, leading to

Delta

Approximation

Semantic Segmentation

Sentence Parsing

Delta

Approximation Sco

re(z

)

zSco

re(y

)

y

S(yi) S(zj)

y

Sco

re(z

)

zySco

re(y

)

C(yi, zj)

z

Figure 2: Illustrative inter-module factor graph. Each node takes exponentially-many or infinitely-many states and weuse a ‘delta approximation’ to limit support.

a mere 100 pairs, which is easily enumerable. Wefound that even with such a small set, at least one ofthe solutions in the set tends to be highly accurate,meaning that the hypothesis sets have relatively highrecall. This shows the power of using a small set ofdiverse hypotheses. For a large M , we can exploit anumber of standard ideas from the graphical modelsliterature (e.g. dual decomposition or belief propaga-tion). In fact, this is one reason we show the factorin Figure 2; there is a natural decomposition of theproblem into modules.

Training MEDIATOR. We can express the ME-DIATOR score asM(yi, zj) = wᵀφ(x,yi,zj), as a linear function of score and consistency fea-tures φ(x,yi, zj) = [φS(y

i);φS(zj);

φC(yi, zj)] , where φS(·) are the single-module (se-

mantic segmentation and PPAR module) score fea-tures, and φC(·, ·) are the inter-module consistencyfeatures. We describe these features in detail in theexperiments. We learn these consistency weights wfrom a dataset annotated with ground-truth for thetwo modules y, z. Let {y∗, z∗} denote the oraclepair, composed of the most accurate solutions in thehypothesis sets. We learn the MEDIATOR parame-ters in a discriminative learning fashion by solvingthe following Structured SVM problem:

minw,ξij

1

2wᵀw + C

∑ij

ξij (2a)

s.t. wᵀφ(x,y∗, z∗)︸︷︷︸Score of oracle tuple

− wᵀφ(x,yi, zj)︸︷︷︸Score of any other tuple

≥ 1︸︷︷︸Margin

− ξijL(yi, zj)︸︷︷︸

Slack scaled by loss

∀i, j ∈ {1, . . . ,M}.

(2b)

Intuitively, we can see that the constraint (2b) triesto maximize the (soft) margin between the score of

the oracle pair and all other pairs in the hypothe-sis sets. Importantly, the slack (or violation in themargin) is scaled by the loss of the tuple. Thus,if there are other good pairs not too much worsethan the oracle, the margin for such tuples willnot be tightly enforced. On the other hand, the mar-gin between the oracle and bad tuples will be verystrictly enforced.

This learning procedure requires us to define theloss function L(yi, zj), i.e., the cost of predictinga tuple (semantic segmentation, sentence parsing).We use a weighted average of individual losses:

L(yi, zj) = α`(ygt,yi) + (1− α)`(zgt, zj) (3)

The standard measure for evaluating semantic seg-mentation is average Jaccard Index (or Intersection-over-Union) (Everingham et al., 2010), while forevaluating sentence parses w.r.t. their prepositionalphrase attachment, we use the fraction of preposi-tions correctly attached. In our experiments, we re-port results with such a convex combination of mod-ule loss functions (for different values of α).

4 Experiments

We now describe the setup of our experiments, pro-vide implementation details of the modules, and de-scribe the consistency features.

Dataset. Access to rich annotated image + cap-tion datasets is crucial for performing quantitativeevaluations. Since this is the first paper to study theproblem of joint segmentation and PPAR, we hadto curate our own annotations for PPAR on threeimage caption datasets – ABSTRACT-50S (Vedan-tam et al., 2014), PASCAL-50S (Vedantam et al.,2014) (which expands the UIUC PASCAL sentence

dataset (Rashtchian et al., 2010) from 5 captions perimage to 50), and PASCAL-Context-50S (Mottaghiet al., 2014)) (which uses the PASCAL Context im-age annotations and the same sentences as PASCAL-50S). Our new annotations will be made publiclyavailable. To curate the PASCAL PPAR annotations,we first select all sentences that have prepositionphrase attachment ambiguities. We then plotted thedistribution of prepositions in these sentences. Thetop 7 prepositions are used, as there is a large dropin the frequencies beyond these. The 7 prepositionsare: “on”, “with”, “next to”, “in front of”, “by”,“near”, and “down”. We then further sampled sen-tences to ensure uniform distribution across preposi-tions. We do a similar filtering for the ABSTRACT-50S dataset (we use top-6 prepositions). Detailsabout the filtering can be found in the appendix. Tosummarize the statistics of all three datasets:

1. ABSTRACT-50S (Vedantam et al., 2014):25,000 sentences (50 per image) with 500images from abstract scenes made from cli-part. Filtering for captions containing the top-6prepositions resulted in 399 sentences describ-ing 201 unique images. These 6 prepositionsare: “with”, ‘next to”, “on top of”, “in frontof”, “behind”, and “under”.

2. PASCAL-50S (Vedantam et al., 2014): 50,000sentences (50 per image) for the images in theUIUC PASCAL sentence dataset (Rashtchianet al., 2010). Filtering for the top-7 preposi-tions resulted in a total of 30 unique images,and 100 image-caption pairs, where ground-truth PPAR were carefully annotated by two vi-sion + NLP graduate students.

3. PASCAL-Context-50S (Mottaghi et al.,2014): We use images and captions fromPASCAL-50S, but with PASCAL Contextsegmentation annotations (60 categories in-stead of 21). This makes the vision task morechallenging. Filtering this dataset for the top-7prepositions resulted in a total of 966 uniqueimages and 1822 image-caption pairs. Groundtruth annotations for the PPAR were collectedusing Amazon Mechanical Turk. Workers wereshown an image and a prepositional attachment(extracted from the corresponding parsing ofthe caption) as a phrase (“woman on couch”),and asked if it was correct. A screenshot of

our interface is available in the supplement.Of the 1,822 sentences, there are 2,540 totalprepositions, 2,147 ambiguous prepositions,84.53% ambiguity rate and 283 sentences withmultiple ambiguous prepositions.

Setup. Single Module: We first show that visualfeatures help the PPAR by using ABSTRACT-50Sdataset, which contains clipart scenes where the ex-tent and position of all the objects in the scene isknown. This allows us to consider a scenario withperfect vision system.

Multiple Modules: In this experiment we useimperfect language and vision modules, and showimprovements on the PASCAL-50S and PASCAL-Context-50S dataset.

Module 1: Semantic Segmentation (SS) y. Weuse DeepLab-CRF (Chen et al., 2015) and Di-vMBest (Batra et al., 2012) to produce M diversesegmentations of the images. To evaluate we useimage-level class-averaged Jaccard Index.

Module 2: PP Attachment Resolution (PPAR)z. We use a recent version (v3.3.1; released 2014)of the PCFG Stanford parser module (De Marn-effe et al., 2006, Huang and Chiang, 2005) to pro-duce M parsings of the sentence. In addition tothe parse trees, the module can also output depen-dencies, which make syntactical relationships moreexplicit. Dependencies come in the form depen-dency type(word1, word2), such as the prepositiondependency prep on(woman-8, couch-11) (numberindicates word position in sentence). To evaluate,we count the percentage of preposition attachmentsthat the parse gets correct.

Baselines:• INDEP. In our experiments, we compare our

proposed approach (MEDIATOR) to the highestscoring solution predicted independently fromeach module. For the semantic segmentationthis is the output of DeepLab-CRF (Chen et al.,2015) and for the PPAR module this is the 1-best output of the Stanford Parser (De Marneffeet al., 2006, Huang and Chiang, 2005). Sinceour hypothesis lists are generated by greedy M-Best algorithms, this corresponds to predictingthe (y1, z1) tuple. This comparison establishesthe importance of joint reasoning. To the bestof our knowledge, there is no existing (or evennatural) joint model to compare to.

• DOMAIN ADAPTATION We learn a re-rankeron the parses. Note that domain adaptation isonly needed for PPAR since the Stanford parseris trained on Penn Treebank (Wall Street Jour-nal text) and not on text about images (such asimage captions). Such domain adaptation is notnecessary for segmentation. This is a compet-itive single-module baseline. Specifically, weuse the same parse-based features as our ap-proach, and learn a reranker over the the Mz

parse trees (Mz = 10).Our approach (MEDIATOR) significantly outper-

forms both baselines. The improvements over IN-DEP shows that joint reasoning produces more ac-curate results than any module (vision or language)operating in isolation. The improvements over DO-MAIN ADAPTATION establish the source of im-provements is indeed vision, and not the rerankingstep. Simply adapting the parse from its originaltraining domain (Wall Street Journal) to our domain(image captions) is not enough.

Ablative Study: Ours-CASCADE: This ablationstudies the importance of multiple hypothesis. Foreach module (say y), we feed the single-best out-put of the other module z1 as input. Each modulelearns its own weight w using exactly the same con-sistency features and learning algorithm as MEDI-ATOR and predicts one of the plausible hypothesesyCASCADE = argmaxy∈Y wᵀφ(x,y, z1). This ab-lation of our system is similar to (Heitz et al., 2008)and helps us in disentangling the benefits of multiplehypothesis and joint reasoning.

Finally, we note that Ours-CASCADE can beviewed as special cases of MEDIATOR. Let MEDI-ATOR-(My,Mz) denote our approach run with My

hypotheses for the first module and Mz for the sec-ond. Then INDEP corresponds to MEDIATOR-(1, 1)and CASCADE corresponds to predicting the y so-lution from MEDIATOR-(My, 1) and the z solutionfrom MEDIATOR-(1,Mz). To get an upper-boundon our approach, we report oracle, the accuracyof the most accurate tuple in 10× 10 tuples.

In the main paper, our results are presented whereMEDIATOR was trained with equally weighted loss(α = 0.5), but we provide additional results forvarying values of α in the supplement.

MEDIATOR and Consistency Features.Recall that we have two types of features – (1)

score features φS(yi) and φS(zj), which try to cap-ture how likely solutions yi and zj are respectively,and (2) consistency features φC(yi, zj), which cap-ture how consistent the PP attachments in zj are withthe segmentation in yi. For each (object1, preposi-tion, object2) in zj , we compute 6 features betweenobject1 and object2 segmentations in yi. Since thehumans writing the captions may use multiple syn-onymous words (e.g. dog, puppy) for the same vi-sual entity, we use Word2Vec (Mikolov et al., 2013)similarities to map the nouns in the sentences to thecorresponding dataset categories.

• Semantic Segmentation Score Features(φS(yi)) (2-dim): We use the ranks and thesolution scores from DeepLab-CRF (Chen etal., 2015).

• PPAR Score Features (φS(zi)) (9-dim): Weuse the ranks and the log probability ofparses from (De Marneffe et al., 2006), and7 binary indicators for PASCAL-50S (6 forABSTRACT-50S) denoting which prepositionsare present in the parse.

• Inter-Module Consistency Features (56-dim): For each of the 7 prepositions, 8 featuresare calculated:

– One feature is the Euclidean distancebetween the center of the segmentationmasks of the two objects in segmenta-tion that are connected by the preposition.These two objects in the segmentation cor-respond to the categories with which thesoft similarity of the two objects in thesentence is highest among all PASCALcategories.

– Four features capture the max{0,(normalized-directional-distance)},where directional-distance measuresabove/below/left/right displacementsbetween the two objects in segmentation,and normalization involves dividing byheight/width.

– One feature is the ratio of the size ofobject1 and object2 in segmentation.

– Two features capture the Word2Vec sim-ilarity between the two objects in PPAR(say ‘puppy’ and ‘kitty’) with their mostsimilar PASCAL category (say ‘dog’ and‘cat’), where these features are 0 if the cat-

egories are not present in segmentation.A visual illustration for some of these featuresfor PASCAL can be seen in Figure 3.

Figure 3: Example on PASCAL-50S (“A dog is stand-ing next to a woman on a couch.”). The ambiguity in thissentence “(dog next to woman) on couch” vs “dog next to(woman on couch)”. We calculate the horizontal and ver-tical distances between the segmentation centers of “per-son” and “couch” and between the segmentation centersof “dog” and “couch”. We see that the “dog” is much fur-ther below the couch (53.91) than the woman (2.65). So,if the MEDIATOR model learned that “on” means the firstobject is above the second object, we would expect it tochoose the “person on couch” preposition parsing.

In the case where an object parsed from zj isnot present in the segmentation yi, the distancefeatures are set to 0. The ratio of areas fea-tures (area of smaller object / area of larger ob-ject) are also set to 0 assuming that the smallerobject is missing. In the case where an objecthas two or more connected components in thesegmentation, the distances are computed w.r.t.the centroid of the segmentation and the area iscomputed as the number of pixels in the unionof the instance segmentation masks.We also calculate 20 features for PASCAL-50Sand 59 features for PASCAL-Context-50S thatcapture that consistency between yi and zj , interms of presence/absence of PASCAL cate-gories. For each noun in PPAR we computeits Word2Vec similarity with all PASCAL cat-egories. For each of the PASCAL categories,the feature is sum of similarities (with the PAS-CAL category) over all nouns if the category ispresent in segmentation, the feature is -1 timessum of similarities over all nouns otherwise.This feature set was not used for ABSTRACT-50S, since these features were intended to help

improve the accuracy of the semantic segmen-tation module. For ABSTRACT-50S, we onlyuse the 5 distance features, resulting in a 30-dim feature vector.

4.1 Single-Module ResultsWe performed a 10-fold cross-validation on theABSTRACT-50S dataset to pick M (=10) and theweight on the hinge-loss for MEDIATOR (C). Theresults are presented in Table 1. Our approach sig-nificantly outpeforms 1-best outputs of the Stan-ford Parser (De Marneffe et al., 2006) by 20.66%(36.42% relative). This shows a need for diverse hy-potheses and reasoning about visual features whenpicking a sentence parse. oracle denotes the bestachievable performance using these 10 hypotheses.

4.2 Multiple-Module ResultsWe performed 10-fold cross-val for our results ofPASCAL-50S and PASCAL-Context-50S, with 8train folds, 1 val fold, and 1 test fold, where theval fold was used to pick My, Mz, and C. Figure 4shows the average combined accuracy for the val

set, which found to be maximal at My = 5,Mz = 3for PASCAL-50S, and My = 1,Mz = 10 forPASCAL-Context-50S, which are used at test time.

ModuleStanfordParser

DomainAdaptation Ours oracle

PPAR 56.73 57.23 77.39 97.53Table 1: Results on our subset of ABSTRACT-50S.

We present our results in Table 2. Ourapproach significantly outperforms the StanfordParser (De Marneffe et al., 2006) by 17.91%(28.69% relative) for PASCAL-50S, and 12.83%(25.28% relative) for PASCAL-Context-50S. Wealso make small improvements over DeepLab-CRF (Chen et al., 2015) in the case of PASCAL-50S.To measure statistical significance of our results, weperformed paired t-tests between MEDIATOR andINDEP. For both modules (and average), the nullhypothesis (that the accuracies of our approach andbaseline come from the same distribution) can besuccessfully rejected at p-value 0.05. For sake ofcompleteness, we also compared MEDIATOR withour ablated system (CASCADE) and found statisti-cally significant differences only in PPAR. These re-

(a) (b) (c)Figure 4: (a) Validation accuracies for different values of M on ABSTRACT-50S, (b) for different values of My,Mz

on PASCAL-50S, (c) for different values of My,Mz on PASCAL-Context-50S.

PASCAL-50S PASCAL-Context-50S

Instance-LevelJaccard Index PPAR Acc. Average

Instance-LevelJaccard Index PPAR Acc. Average

DeepLab-CRF 66.83 - - 43.94 - -Stanford Parser - 62.42 - - 50.75 -Average - - 64.63 - - 47.345

Domain Adaptation - 72.08 - - 58.32 -

Ours CASCADE 67.56 75.00 71.28 43.94 63.58 53.76Ours MEDIATOR 67.58 80.33 73.96 43.94 63.58 53.76oracle 69.96 96.50 83.23 49.21 75.75 62.48

Table 2: Results on our subset of the PASCAL-50S and PASCAL-Context-50S datasets. We are able to significantlyoutperform the Stanford Parser and make small improvements over DeepLab-CRF for PASCAL-50S.

sults demonstrate a need for each module to pro-duce a diverse set of plausible hypotheses for ourMEDIATOR model to reason about. In the case ofPASCAL-Context-50S, MEDIATOR performs iden-tical to CASCADE since My is chosen as 1 (whichis the CASCADE setting) in cross-validation. Re-call that MEDIATOR is a larger model class thanCASCADE (in fact, CASCADE is a special case ofMEDIATOR with My = 1). It is interesting tosee that the large model class does not hurt, andMEDIATOR gracefully reduces to a smaller capac-ity model (CASCADE) if the amount of data is notenough to warrant the extra capacity. We hypothe-size that in the presence of more training data, cross-validation may pick a different setting of My andMz, resulting in full utilization of the model capac-ity. Also note that our domain adaptation baselineachieved an accuracy higher than MAP/Stanford-Parser, but significantly lower than our approach forboth PASCAL-50S and PASCAL-Context-50S. Wealso performed this for our single-module experi-ment and picked Mz (=10) with cross-validation,

which resulted in an accuracy of 57.23%. Again,this is higher than MAP/Stanford-Parser (56.73%),but significantly lower than our approach (77.39%).Clearly, this shows that just domain adaptation is notsufficient.

The table also shows that the oracle performanceis fairly high (96.50% for PPAR at 10 parses), sug-gesting that when there is ambiguity and room forimprovement, our mediator is able to rerank effec-tively. We also provide performances and gains overthe independent baseline on PASCAL-Context-50Sfor each preposition in Table 4. We see that visionhelps all prepositions.

Ablation Study for Features. Table 3 displaysthe results of an ablation study on PASCAL-50S andPASCAL-CONTEXT-50S to show the importanceof the different features. In each row, we retain themodule score features and drop a single set of con-sistency features. We can see that all of the consis-tency features contribute in the performance of ME-DIATOR.

Visualizing Prepositions. Figure 5 shows a vi-

on

by

with

Figure 5: Visualizations forthree different prepositions.Here red and blue indicate highand low scores, respectively.

PASCAL-50S PASCAL-Context-50S

Feature setInstance-LevelJaccard Index PPAR Acc. PPAR Acc.

All features 67.58 80.33 63.58Drop all consistency 66.96 66.67 61.47Drop Euclidean distance 67.27 77.33 63.77Drop directional distance 67.12 78.67 63.63Drop Word2Vec 67.58 78.33 62.72Drop category presence 67.48 79.25 61.19

Table 3: Ablation study for different combinations of features on PASCAL-50Sand PASCAL-Context-50S. We only show PPAR Acc. for PASCAL-Context-50Sbecause My = 1.

“on” “with” “next to” “in front of” “by” “near” “down”

Acc. 64.26 63.30 60.98 56.86 62.81 67.83 67.23Gain 12.47 15.41 11.18 13.27 14.13 12.82 17.16

Table 4: Performances and gains over the independent baseline on PASCAL-Context-50S for each preposition.

sualization for what our MEDIATOR model has im-plicitly learned about 3 prepositions (“on”, “by”,“with”). These visualizations show the score ob-tained by taking the dot product of distance fea-tures (Euclidean and directional) between object1and object2 connected by the preposition with thecorresponding learned weights of the model, consid-ering object2 to be at the center of the visualization.Notice that these were learned without explicit train-ing for spatial learning as in spatial relation learning(SRL) works (Malinowski and Fritz, 2014b, Lan etal., 2012b). These were simply recovered as an in-termediate step towards reranking SS + PPAR hy-potheses. Also note that SRL cannot handle multi-ple segmentation hypotheses, which our work showsare important (Table 2 CASCADE). In addition ourapproach is more general.

5 Discussions and Conclusion

We presented an approach to the simultaneous rea-soning about prepositional phrase attachment res-olution of captions and semantic segmentation inimages that integrates beliefs across the modulesto pick the best pair of a diverse set of hypothe-ses. Our full model (MEDIATOR) significantlyimproves the accuracy of PPAR over the Stan-ford Parser by 17.91% for PASCAL-50S and by12.83% for PASCAL-Context-50S, and achievesa small improvement on Semantic Segmentation

over DeepLab-CRF for PASCAL-50S. These resultsdemonstrate a need for information exchange be-tween the modules, as well as a need for a diverse setof hypotheses to concisely capture the uncertaintiesof each module. The large gains in PPAR validateour intuition that vision is very helpful for dealingwith ambiguity in language. Furthermore, we seefrom the oracle accuracies that even larger gains arepossible.

While we have demonstrated our approach ona task involving simultaneous reasoning about lan-guage and vision, our approach is general and canbe used for other applications. Overall, we hope ourapproach will be useful in a number of settings.

Acknowledgements. We thank Larry Zitnick,Mohit Bansal, Kevin Gimpel, and Devi Parikh forhelpful discussions, suggestions, and feedback thatwere included in this work. A majority of this workwas done while AL was an intern at Virginia Tech.This work was partially supported by a NationalScience Foundation CAREER award, an Army Re-search Office YIP Award, an Office of Naval Re-search grant, and GPU donations by NVIDIA, allawarded to DB. GC was supported by DTRA BasicResearch provided by KK. The views and conclu-sions contained herein are those of the authors andshould not be interpreted as necessarily representingofficial policies or endorsements, either expressed orimplied, of the U.S. Government or any sponsor.

Appendix OverviewIn this appendix we provide the following:

Appendix I: Additional motivation for our ME-DIATOR model.Appendix II: Background on ABSTRACT-50S.Appendix III: Details of the dataset curationprocess for the ABSTRACT-50S, PASCAL-50S, and PASCAL-Context-50S datasets.Appendix IV: Qualitative examples from ourapproach.Appendix V: Results where we study the effectof varying the weighting of each module in ourapproach.

Appendix I Additional Motivation forMEDIATOR

An example is shown in Figure 6, where the am-biguous sentence that describes the image is “A dogis standing next to a woman on a couch”, wherethe ambiguity is “(dog next to woman) on couch”vs “dog next to (woman on couch)”, which is re-flected in parse trees’ uncertainty. Parse tree #1(Figure 6g) shows “standing” (the verb phrase ofthe noun “dog”) connected with “couch” via the“on” preposition, whereas parse trees #2 (Figure 6h)and #3 (Figure 6i) show “woman” connected with“couch” via the “on” preposition. This ambiguitycan be resolved if we look at an accurate seman-tic segmentation such as Hypothesis #3 (Figure 6f)of the associated image (Figure 6b). Likewise, wemight be able to do better at semantic segmentationif we choose a segmentation that is consistent withthe sentence, such as Segmentation Hypothesis #3(Figure 6f), which contains a person on a couch witha dog next to them, unlike the other two hypotheses(Figure 6d and Figure 6e).

Appendix II Background AboutABSTRACT-50S

The Abstract Scenes dataset (Zitnick and Parikh,2013) contains synthetic images generated by hu-man subjects via a drag-and-drop clipart interface.The subjects are given access to a (random) subset of56 clipart objects that can be found in park scenes,as well as two characters, Mike and Jenny, with avariety of poses and expressions. Example scenes

can be found in Figure 7. The motivation is to allowresearchers to focus on higher-level semantic under-standing without having to deal with noisy infor-mation extraction from real images, since the entirecontents of the scene are known exactly, while alsoproviding a dense semantic space to study (due tothe heavily constrained world). We used the datasetin precisely this way to first test out the PPAR mod-ule in isolation to demonstrate that this problem canbe helped by a sentence’s corresponding image.

Figure 7: We show some example scenes from (Zitnickand Parikh, 2013). Each column shows two semanticallysimilar scenes, while the different columns show the di-versity of scene types.

Appendix III Dataset Curation andAnnotation

The subsets of the PASCAL-50S and ABSTRACT-50S datasets used in the main paper were carefullycurated by two vision + NLP graduate students. Thesubset of the PASCAL-Context-50S dataset used inthe main paper was curated by Mechanical Turkworkers. The following describes the dataset cura-tion process for each datset.

PASCAL-50S: For PASCAL-50S we first ob-tained sentences that contain one or more of 7 prepo-sitions (i.e., “with”, “next to”, “on top of”, “in frontof”, “behind”, “by”, and “on”) that intuitively wouldtypically depend on the relative distance betweenobjects. Then we look for sentences that have prepo-sition phrase attachment ambiguities, i.e., sentenceswhere the parser output has different sets of prepo-sitions for different parsings. (Note that, due to ourfocus on PP attachment, we do not pay attention toother parts of the sentence parse, so the parses canchange while the PP attachments remain the same,as in Figure 6h and Figure 6i.) The sentences thus

(a) Caption (b) Input Image (c) Segmentation GT

(d) Segmentation Hypothesis #1 (e) Segmentation Hypothesis #2 (f) Segmentation Hypothesis #3

(g) Parse Hypothesis #1 (h) Parse Hypothesis #2 (i) Parse Hypothesis #3Figure 6: In this figure, we illustrate why the MEDIATOR model makes sense for the task of captioned scene under-standing. For the caption-image pair (Figure 6a-Figure 6b), we see that parse tree #1 (Figure 6g) shows “standing” (theverb phrase of the noun “dog”) connected with “couch” via the “on” preposition, whereas parse trees #2 (Figure 6h)and #3 (Figure 6i) show “woman” connected with “couch” via the “on” preposition. This ambiguity can be resolved ifwe look at an accurate semantic segmentation such as Hypothesis #3 (Figure 6f) of the associated image (Figure 6b).Likewise, we might be able to do better at semantic segmentation if we choose a segmentation that is consistent withthe sentence, such as Segmentation Hypothesis #3 (Figure 6f), which contains a person on a couch with a dog next tothem, unlike the other two hypotheses (Figure 6d and Figure 6e).

obtained are further filtered to obtain sentences inwhich the objects that are connected by the prepo-sition belong to one of the 20 PASCAL object cat-egories. Since our Module 1 is semantic segmenta-tion and not instance-level segmentation, we restrictthe dataset to sentences involving prepositions con-necting two different PASCAL categories. Thus, ourfinal dataset contains 100 sentences describing 30unique images and contains 16 of the 20 PASCALcategories as described in the paper. We then man-ually annotated the ground truth PP attachments.Note that such manual labeling by annotators stu-

dents with expertise in NLP takes a lot of time, butresults in annotations that are linguistically high-quality, with any inter-human disagreement resolvedby strict adherence to rules of grammar.

ABSTRACT-50S: We first obtained sentencesthat contain one or more of 6 prepositions (i.e.,“with”, “next to”, “on top of”, “in front of”, “be-hind”, “under”). Note, due to the semantic dif-ferences between the datasets, not all prepositionsfound in one were present in the other. Furtherfiltering on sentences was done to ensure that thesentences contain a least one preposition phrase

Figure 8: We show the percentage of ambiguous sen-tences in PASCAL-Context-50S dataset before filteringfor prepositions. We found that there was a drop in thepercentage of sentences for prepositions that appear in thesorted list after “down”. So, for the PASCAL-Context-50S dataset we only keep sentences that have one ormore visual prepositions in the list of prepositions upto“down”.

attachment ambiguity that is between the clipartnoun categories (i.e., each clipart piece has a name,like “snake”, that we search the sentence parsingfor). This filtering reduced the original dataset of25,000 sentences and 500 scenes to our final exper-iment dataset of 399 sentences and 201 scenes. Wethen manually annotated the ground truth PP attach-ments.

PASCAL-Context-50SFor PASCAL-Context-50S, we first selected all

sentences that have preposition phrase attachmentambiguities. We then plotted the distribution ofprepositions in these sentences (see Figure 8.). Wefound that there was a drop in the percentage of sen-tences for prepositions that appear in the sorted listafter “down”. Therefore, we only kept sentencesthat have one or more 2-D visual prepositions in thelist of prepositions upto “down”. Thus we endedup with the following 7 prepositions: “on”, “with”,“next to”, “in front of”, “by”, “near”, and “down”.We then further sampled sentences to ensure uni-form distribution across prepositions. Note that un-like PASCAL-50S, we did not filter sentences basedon whether the objects connected by the prepositionsbelong to one of 60 PASCAL Context categories ornot. Instead, we used the Word2Vec (Mikolov et al.,2013) similarity between the objects in the sentenceand the PASCAL Context categories as one of thefeatures. Thus, our final dataset contains 1822 sen-

tences describing 966 unique images.The ground truth PP attachments for these 1822

sentences were annotated by Amazon MechanicalTurk workers. For each unique prepositional rela-tion in a sentence, we showed the workers the prepo-sitional relation of the form primary object preposi-tion secondary object and its associated image andsentence and asked them to specify whether theprepositional relation is correct or not correct. Wealso asked them to choose the third option - “Pri-mary object/ secondary object is not a noun in thecaption” in case that happened. The user interfaceused to collect these annotations is shown in Fig-ure 9. We collected five answers for each preposi-tional relation. For evaluation, we used the major-ity response. We found that 87.11% of human re-sponses agree with the majority response, indicatingthat even though AMT workers were not explicitlytrained in rules of grammar by us, there is relativelyhigh inter-human agreement.

Appendix IV Qualitative Examples

Figure 10-Figure 15 show qualitative examples forour experiments. Figure 10-Figure 12 show exam-ples for the multiple modules examples (semanticsegmentation and PPAR), and Figure 13-Figure 15show examples for the single module experiment. Ineach figure, the top row shows the image and the as-sociated sentence. For the multiple modules figures,the second and third row show the diverse segmen-tations of the image, and the bottom two rows showdifferent parsings of the sentence (last two rows forsingle module examples, as well). In these exam-ples our approach uses 10 diverse solutions for thesemantic segmentation module and 10 different so-lutions for the PPAR module. The highlighted pairsof solutions show the solutions picked by the MEDI-ATOR model. Examining the results can give you asense of how the parsings can help the semantic seg-mentation module pick the best solution and vice-versa.

Figure 9: The AMT interface to collect ground truth annotations for prepositional relations.

A young couple sit with their puppy on the couch.

Figure 10: Example 1 – multiple modules (SS and PPAR)

A man is in a car next to a man with a bicycle.


Boy lying on couch with a sleeping puppy curled up next to him.


Figure 13: Example 1 – single module (PPAR)

Appendix V Effect of Different Weightingof Modules

73.1073.2073.3073.4073.50

485154576063

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Semantic Segmentatio

n Accuracy

PPAR

Accuracy

α

PPAR Semantic Segmentation

(a)

42.0

43.0

44.0

45.0

50

55

60

65

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Semantic Segmentatio

n Accuracy

PPAR

Accuracy

α

PPAR Semantic Segmentation

(b)Figure 16: Accuracies MEDIATOR (both modules) vs α,where α is the coefficient for the semantic segmentationmodule and 1-α is the coefficient for the PPAR resolu-tion module in the loss function. Our approach is fairlyrobust to the setting of α, as long as it is not set to ei-ther extremes, since that limits the synergy between themodules. (a) shows the results for PASCAL-50S, and (b)shows the results for PASCAL-Context-50S.

So far we have used the “natural” setting of α =0.5, which gives equal weight to both modules. Notethat α is not a parameter of our approach; it isa design choice that the user/experiment-designermakes. To see the effect of weighting the mod-ules differently, we tested our approach for vari-ous values of α. Figure 16 shows how the accu-racies of each module vary depending on α for theMEDIATOR model for PASCAL-50S and PASCAL-Context-50S. Recall that α is the coefficient for thesemantic segmentation module and 1-α is the coef-ficient for the PPAR resolution module in the lossfunction. We see that as expected, putting no orlittle weight on the PPAR module drastically hurtsperformance for that module. Our approach is fairly

robust to the setting of α, with a peak lying but anyweight on it performs fairly similar with the peaklying somewhere between the extremes. The seg-mentation module has similar behavior, though it isnot as sensitive to the choice of α. We believe thisis because of small “dynamic range” of this module– the gap between the 1-best and oracle segmenta-tion is smaller and thus the MEDIATOR can alwaysdefault to the 1-best as a safe choice.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C Lawrence Zitnick, andDevi Parikh. 2015. Vqa: Visual question answering.ICCV.

Kent Bach. http://online.sfsu.edu/kbach/ambguity.html. Routledge Encyclopedia of Phi-losophy entry.

Dhruv Batra, Payman Yadollahpour, Abner Guzman-Rivera, and Gregory Shakhnarovich. 2012. Di-verse M-Best Solutions in Markov Random Fields. InECCV.

Liang-Chieh Chen, George Papandreou, Iasonas Kokki-nos, Kevin Murphy, and Alan L Yuille. 2015. Seman-tic image segmentation with deep convolutional netsand fully connected crfs. In ICLR.

Ernest Davis. http://cs.nyu.edu/faculty/davise/ai/ambiguity.html. Notes on Ambi-guity.

Marie-Catherine De Marneffe, Bill MacCartney, Christo-pher D Manning, et al. 2006. Generating typed de-pendency parses from phrase structure parses. In Pro-ceedings of LREC, volume 6, pages 449–454.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. 2010. The pascal visual objectclasses (voc) challenge. IJCV, 88(2):303–338, June.

Hao Fang, Saurabh Gupta, Forrest N. Iandola, Ru-pesh Srivastava, Li Deng, Piotr Dollar, JianfengGao, Xiaodong He, Margaret Mitchell, John C. Platt,C. Lawrence Zitnick, and Geoffrey Zweig. 2015.From Captions to Visual Concepts and Back. InCVPR.

Sanja Fidler, Abhishek Sharma, and Raquel Urtasun.2013. A sentence is worth a thousand pixels. InCVPR.

Donald Geman, Stuart Geman, Neil Hallonquist, andLaurent Younes. 2014. A Visual Turing Test for Com-puter Vision Systems. In PNAS.

Geremy Heitz, Stephen Gould, Ashutosh Saxena, andDaphne Koller. 2008. Cascaded classification mod-els: Combining models for holistic scene understand-ing. In NIPS.

Liang Huang and David Chiang. 2005. Better k-bestparsing. In Proceedings of the Ninth InternationalWorkshop on Parsing Technology (IWPT), pages 53–64.

Jorg H. Kappes, Bjoern Andres, Fred A. Hamprecht,Christoph Schnorr, Sebastian Nowozin, Dhruv Batra,Sungwoong Kim, Bernhard X. Kausler, Jan Lellmann,Nikos Komodakis, and Carsten Rother. 2013. A com-parative study of modern inference techniques for dis-crete energy minimization problems. In CVPR.

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun,and Sanja Fidler. 2014. What are you talking about?text-to-image coreference. In CVPR.

Tian Lan, Weilong Yang, Yang Wang, and Greg Mori.2012a. Image retrieval with structured object queriesusing latent ranking svm. In Computer Vision–ECCV2012, pages 129–142. Springer.

Tian Lan, Weilong Yang, Yang Wang, and Greg Mori.2012b. Image retrieval with structured object queriesusing latent ranking svm. In Proceedings of the 12thEuropean Conference on Computer Vision - VolumePart VI, ECCV’12.

Mateusz Malinowski and Mario Fritz. 2014a. Apooling approach to modelling spatial relations forimage retrieval and annotation. arXiv preprintarXiv:1411.5190.

Mateusz Malinowski and Mario Fritz. 2014b. A pool-ing approach to modelling spatial relations for imageretrieval and annotation. CoRR, abs/1411.5190.

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz.2015. Ask your neurons: A neural-based approach toanswering questions about images. In ICCV.

Talya Meltzer, Chen Yanover, and Yair Weiss. 2005.Globally optimal solutions for energy minimization instereo vision using reweighted belief propagation. InICCV, pages 428–435.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In ICLR.

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-GyuCho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun,and Alan Yuille. 2014. The role of context for ob-ject detection and semantic segmentation in the wild.In IEEE Conference on Computer Vision and PatternRecognition (CVPR).

Massimo Poesio and Ron Artstein. 2005. Annotating(anaphoric) ambiguity. In Corpus Linguistics Confer-ence.

Cyrus Rashtchian, Peter Young, Micah Hodosh, and JuliaHockenmaier. 2010. Collecting image annotations us-ing amazon’s mechanical turk. In Proceedings of theNAACL HLT 2010 Workshop on Creating Speech andLanguage Data with Amazon’s Mechanical Turk.

Adwait Ratnaparkhi, Jeff Reynar, and Salim Roukos.1994. A maximum entropy model for prepositionalphrase attachment. In Proceedings of the workshop onHuman Language Technology, pages 250–255. Asso-ciation for Computational Linguistics.

Richard Szeliski, Ramin Zabih, Daniel Scharstein, OlgaVeksler, Vladimir Kolmogorov, Aseem Agarwala,Marshall Tappen, and Carsten Rother. 2008. A com-parative study of energy minimization methods formarkov random fields with smoothness-based priors.PAMI, 30(6):1068–1080.

Ramakrishna Vedantam, C. Lawrence Zitnick, and DeviParikh. 2014. Cider: Consensus-based image descrip-tion evaluation. In CVPR.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-mitru Erhan. 2015. Show and Tell: A Neural ImageCaption Generator. In CVPR.

Mark Yatskar, Michel Galley, Lucy Vanderwende, andLuke Zettlemoyer. 2014. See no evil, say no evil: De-scription generation from densely labeled images. InProceedings of the Third Joint Conference on Lexicaland Computational Semantics (*SEM 2014).

Licheng Yu, Eunbyung Park, Alexander C Berg, andTamara L Berg. 2015. Visual madlibs: Fill inthe blank image generation and question answering.arXiv.

C. Lawrence Zitnick and Devi Parikh. 2013. BringingSemantics Into Focus Using Visual Abstraction. InCVPR.

arxiv:1604.02125v3 [cs.cv] 2 sep 2016 · resolving vision and language ambiguities together: joint...

Documents