see no evil, say no evil: description generation from...

See No Evil, Say No Evil:Description Generation from Densely Labeled Images

Mark Yatskar 1∗

[email protected]

1Computer Science & EngineeringUniversity of WashingtonSeattle, WA, 98195, USA

Michel Galley2

[email protected]

Lucy Vanderwende2

[email protected]

Luke Zettlemoyer1

[email protected]

2Microsoft ResearchOne Microsoft Way

Redmond, WA, 98052, USA

Abstract

This paper studies generation of descrip-tive sentences from densely annotated im-ages. Previous work studied generationfrom automatically detected visual infor-mation but produced a limited class of sen-tences, hindered by currently unreliablerecognition of activities and attributes. In-stead, we collect human annotations of ob-jects, parts, attributes and activities in im-ages. These annotations allow us to builda significantly more comprehensive modelof language generation and allow us tostudy what visual information is requiredto generate human-like descriptions. Ex-periments demonstrate high quality outputand that activity annotations and relativespatial location of objects contribute mostto producing high quality sentences.

1 Introduction

Image descriptions compactly summarize com-plex visual scenes. For example, consider the de-scriptions of the image in Figure 1, which vary incontent but focus on the women and what they aredoing. Automatically generating such descriptionsis challenging: a full system must understand theimage, select the relevant visual content to present,and construct complete sentences. Existing sys-tems aim to address all of these challenges butuse visual detectors for only a small vocabularyof words, typically nouns, associated with objectsthat can be reliably found.1 Such systems are blind

∗This work was conducted at Microsoft Research.1While object recognition is improving (ImageNet accu-

racy is over 90% for 1000 classes) progress in activity recog-nition has been slower; the state of the art is below 50% meanaverage precision for 40 activity classes (Yao et al., 2011).

cars (Count:3) Isa: ride, vehicle,… Doing: parking,…

Has: steering wheel,… Attrib: black, shiny,…

children (Count:2) Isa: kids, children …

Doing: biking, riding … Has: pants, bike …

Attrib: young, small …

bike (Count:1) Isa: bike, bicycle,… Doing: playing,…

Has: chain, pedal,… Attrib: silver, white,…

women(Count:3) Isa: girls, models,…

Doing: smiling,... Has: shorts, bags,… Attrib: young, tan,…

purses(Count:3) Isa: accessory,…

Doing: containing,… Has: body, straps,… Attrib: black, soft,…

sidewalk(Count:1) Isa: sidewalk, street,…

Doing: laying,… Has: stone, cracks,… Attrib: flat, wide,…

woman(Count:1) Isa: person, female,…

Doing: pointing,… Has: nose, legs,…

Attrib: tall, skinny,…

tree(Count:1) Isa: plant,…

Doing: growing,… Has: branches,…

Attrib: tall, green,…

kids(Count:5) Isa: group, teens,… Doing: walking,…

Has: shoes, bags,… Attrib: young,…

Five young people on the street, two sharing a bicycle.Several young people are walking near parked vehicles.

Three girls with large handbags walking down the sidewalk.Three women walk down a city street, as seen from above.Three young woman walking down a sidewalk looking up.

Figure 1: An annotated image with human generated sen-tence descriptions. Each bounding polygon encompasses oneor more objects and is associated with a count and text la-bels.This image has 9 high level objects annotated with over250 textual labels.

to much of the visual content needed to generatecomplete, human-like sentences.

In this paper, we instead study generation withmore complete visual support, as provided by hu-man annotations, allowing us to develop morecomprehensive models than previously consid-ered. Such models have the dual benefit of (1)providing new insights into how to construct morehuman-like sentences and (2) allowing us to per-form experiments that systematically study thecontribution of different visual cues in generation,suggesting which automatic detectors would bemost beneficial for generation.

In an effort to approximate relatively completevisual recognition, we collected manually labeledrepresentations of objects, parts, attributes and ac-tivities for a benchmark caption generation datasetthat includes images paired with human authored

descriptions (Rashtchian et al., 2010).2 As seenin Figure 1, the labels include object boundariesand descriptive text, here including the facts thatthe children are “riding” and “walking” and thatthey are “young.” Our goal is to be as exhaustiveas possible, giving equal treatment to all objects.For example, the annotations in Figure 1 containenough information to generate the first three sen-tences and most of the content in the remainingtwo. Labels gathered in this way are a type of fea-ture norms (McRae et al., 2005), which have beenused in the cognitive science literature to approxi-mate human perception and were recently used asa visual proxy in distributional semantics (Silbererand Lapata, 2012). We present the first effort, thatwe are aware of, for using feature norms to studyimage description generation.

Such rich data allows us to develop significantlymore comprehensive generation models. We di-vide generation into choices about which visualcontent to select and how to realize a sentence thatdescribes that content. Our approach is grammar-based, feature-rich, and jointly models both deci-sions. The content selection model includes la-tent variables that align phrases to visual objectsand features that, for example, measure how vi-sual salience and spatial relationships influencewhich objects are mentioned. The realization ap-proach considers a number of cues, including lan-guage model scores, word specificity, and relativespatial information (e.g. to produce the best spa-tial prepositions), when producing the final sen-tence. When used with a reranking model, includ-ing global cues such as sentence length, this ap-proach provides a full generation system.

Our experiments demonstrate high quality vi-sual content selection, within 90% of human per-formance on unigram BLEU, and improved com-plete sentence generation, nearly halving the dif-ference from human performance to two base-lines on 4-gram BLEU. In ablations, we measurethe importance of different annotations and visualcues, showing that annotation of activities and rel-ative bounding box information between objectsare crucial to generating human-like description.

2 Related Work

A number of approaches have been proposedfor constructing sentences from images, includ-ing copying captions from other images (Farhadi

2Available at : http://homes.cs.washington.edu/˜my89/

et al., 2010; Ordonez et al., 2011), using textsurrounding an image in a news article (Fengand Lapata, 2010), filling visual sentence tem-plates (Kulkarni et al., 2011; Yang et al., 2011;Elliott and Keller, 2013), and stitching together ex-isting sentence descriptions (Gupta and Mannem,2012; Kuznetsova et al., 2012). However, due tothe lack of reliable detectors, especially for activi-ties, many previous systems have a small vocab-ulary and must generate many words, includingverbs, with no direct visual support. These prob-lems also extend to video caption systems (Yu andSiskind, 2013; Krishnamoorthy et al., 2013).

The Midge algorithm (Mitchell et al., 2012)is most closely related to our approach, and willprovide a baseline in our experiments. Midge issyntax-driven but again uses a small vocabularywithout direct visual support for every word. Itoutputs a large set of sentences to describe alltriplets of recognized objects in the scene, but doesnot include a content selection model to select thebest sentence. We extend Midge with content andsentence selection rules to use it as a baseline.

The visual facts we annotate are motivated byresearch in machine vision. Attributes are agood intermediate representation for categoriza-tion (Farhadi et al., 2009). Activity recognitionis an emerging area in images (Li and Fei-Fei,2007; Yao et al., 2011; Sharma et al., 2013) andvideo (Weinland et al., 2011), although less stud-ied than object recognition. Also, parts have beenwidely used in object recognition (Felzenszwalbet al., 2010). Yet, no work tests the contribution ofthese labels for sentence generation.

There is also a significant amount of workon other grounded language problems, where re-lated models have been developed. Visual re-ferring expression generation systems (Krahmerand Van Deemter, 2012; Mitchell et al., 2013;FitzGerald et al., 2013) aim to identify specificobjects, a sub-problem we deal with when de-scribing images more generally. Other researchgenerates descriptions in simulated worlds and,like this work, uses feature rich models (Angeliet al., 2010), or syntactic structures like PCFGs(Chen et al., 2010; Konstas and Lapata, 2012) butdoes not combine the two. Finally, Zitnick andParikh (2013) study sentences describing clipartscenes. They present a number of factors influenc-ing overall descriptive quality, several of which weuse in sentence generation for the first time.

3 Dataset

We collected a dataset of richly annotated imagesto approximate gold standard visual recognition.In collecting the data, we sought a visual annota-tion with sufficient coverage to support the gen-eration of as many of the words in the originalimage descriptions as possible. We also aimed tomake it as visually exhaustive as possible—givingequal treatment to all visible objects. This ensuresless bias from annotators’ perception about whichobjects are important, since one of the problemswe would like to solve is content selection. Thisdataset will be available for future experiments.

We built on the dataset from (Rashtchian etal., 2010) which contained 8,000 Flickr imagesand associated descriptions gathered using Ama-zon Mechanical Turk (MTurk). Restricting our-selves to Creative Commons images, we sampled500 images for annotation.

We collected annotations of images in threestages using MTurk, and assigned each annotationtask to 3-5 workers to improve quality through re-dundancy (Callison-Burch, 2009). Below we de-scribe the process for annotating a single image.

Stage 1: We prompted five turkers to list all ob-jects in an image, ignoring objects that are parts oflarger objects (e.g., the arms of a person), whichwe collected later in Stage 3. This list also in-cluded groups, such as crowds of people.

Stage 2: For each unique object label fromStage 1, we asked two turkers to draw a polygonaround the object identified.3 In cases where theobject is a group, we also asked for the number ofobjects present (1-6 or many). Finally, we createda list of all references to the object from the firststage, which we call the Object facet.

Stage 3: For each object or group, we promptedthree turkers to provide descriptive phrases of:

• Doing – actions the object participates in, e.g.“jumping.”• Parts – physical parts e.g. “legs”, or other

items in the possession of the object e.g.“shirt.”• Attributes – adjectives describing the object,

e.g. “red.”• Isa – alternative names for a object e.g.

“boy”, “rider.”

Figure 1 shows more examples for objects

3We modified LabelMe (Torralba et al., 2010).

in a labeled image.4 We refer to all of theseannotations, including the merged Object la-bels, as facets. These labels provide featurenorms (McRae et al., 2005), which have recentlyused as a visual proxy in distributional seman-tics (Silberer and Lapata, 2012; Silberer et al.,2013) but have not been previous studied for gen-eration. This annotation of 500 images (2500sentences) yielded over 4000 object instances and100,000 textual labels.

4 Approach

Given such rich annotations, we can now de-velop significantly more comprehensive genera-tion models. In this section, we present an ap-proach that first uses a generative model and thena reranker. The generative model defines a dis-tribution over content selection and content real-ization choices, using diverse cues from the imageannotations. The reranker trades off our generativemodel score, language model score (to encouragefluency), and length to produce the final sentence.

Generative Model We want to generate a sen-tence ~w = 〈w1 . . . wn〉 where each word wi ∈ Vcomes from a fixed vocabulary V . The vocabu-lary V includes all 2700 words used in descriptivesentences in the training set.5

The model conditions on an annotated image Ithat contains a set of objects O, where each ob-ject o ∈ O has a bounding polygon and a numberof facets containing string labels. To model thenaming of specific objects, words wi can be asso-ciated with alignment variables ai that range overO. One such variable is introduced for each headnoun in the sentence. Figure 2 shows alignmentvariable settings with colors that match objects inthe image. Finally, as a byproduct of the hierarchi-cal generative process, we construct an undirecteddependency tree ~d over the words in ~w.

The complete generative model defines theprobability p(~w,~a, ~d | I) of a sentence ~w, wordalignments ~a, and undirected dependency tree ~d,given the annotated input image I . The overallprocess unfolds recursively, as seen in Figure 3.

4In the experiments, Parts and Isa facets do not improveperformance, so we do not use them in the final model. Isais redundant with the Object facet, as seen in Figure 1. Alsoparts like clothing, were often annotated as separate objects.

5We do not generate from image facets directly, becauseonly 20% of the sentences in our data can be produced likethis. Instead, we develop features which consider the similar-ity between labels in the image and words in the vocabulary.

Three girls with large handbags walking down the sidewalk

𝑎 =

𝑤=

𝑑 =

Figure 2: One path through the generative model and theBayesian network it induces. The first row of colored circlesare alignment variables to objects in the image. The secondrow is words, generated conditioned on alignments.

The main clause is produced by first selecting thesubject alignment as followed by the subject wordws. It then chooses the verb and optionally the ob-ject alignment ao and word wo. The process thencontinues recursively, modifying the subject, verb,and object of the sentence with noun and prepo-sitional modifiers. The recursion begins at Step2 in Figure 3. Given a parent word w and thatword’s relevant alignment variable a, the modelcreates attachments where w is the grammaticalhead of subsequently produced words. Choicesabout whether to create noun modifiers or preposi-tional modifiers are made in steps (a) and (b). Theprocess chooses values for the alignment variablesand then chooses content words, adding connec-tive prepositions in the case of prepositional mod-ifiers. It then chooses to end or submits new word-alignment pairs to be recursively modified.

Each line defines a decision that must be madeaccording to a local probability distribution. Forexample, Step 1.a defines the probability of align-ing a subject word to various objects in the im-age. The distributions are maximum entropy mod-els, similar to previous work (Angeli et al., 2010),using features described in the next section. Theinduced undirected dependency tree ~d has an edgebetween each word and the previously generatedword (or the input word w in Steps 2.a.i and 2.a.ii,when no previous word is available). Figure 2shows a possible output from the process, alongwith the Bayesian network that encodes what eachdecision was conditioned on during generation.

Learning We learn the model from data{(~wi, ~di, Ii) | i = 1 . . .m} containing sentences~wi, dependency trees ~di, computed with the Stan-ford parser (de Marneffe et al., 2006), and images

1. for a main clause (d,e are optional), select:

(a) subject as alignment from pa(a).

(b) subject word ws from pn(w | as, ~dc)

(c) verb word wv from pv(w | as, ~dc)

(d) object alignment ao from pa(a′ | as, wv, ~dc)

(e) object word wo from pn(w | ao, ~dc)

(f) end with pstop or go to (2) with (ws, as)

(g) end with pstop or go to (2) with (wv, as)

(h) end with pstop or go to (2) with (wo, ao)

2. for a (word, alignment) (w′, a) (a,b are optional):

(a) if w′ not verb: modify w′ with noun, select:

i. modifier word wn from pn(w | a, ~dc).ii. end with pstop or go to (2) with (am, wn)

(b) modify w′ with preposition, select:i. preposition word wp

if w′ not a verb: from pp(w | a, ~dc)else: from pp(w | a,wv, ~dc)

ii. object alignment ap from pa(a′ | a,wp, ~dc)

iii. object word wn from pn(w | ap, ~dc).iv. end with pstop or go to (2) with (ap, wn)

Figure 3: Generative process for producing words ~w, align-ments ~a and dependencies ~d. Each distribution is conditionedon the partially complete path through generative process ~dcto establish sentence context. The notation pstop is short handfor pstop(STOP |~w, ~dc) the stopping distribution.

Ii. The dependency trees define the path that wastaken through the generative process in Figure 3and are used to create a Bayesian network for ev-ery sentence, like in Figure 2. However, objectalignments ~ai are latent during learning and wemust marginalize over them.

The model is trained to maximize the condi-tional marginal log-likelihood of the data with reg-ularization:

L(θ) =∑i

log∑~a

p(~a, ~wi, ~di | Ii; θ)− r|θ|2

where θ is the set of parameters and r is the regu-larization coefficient. In essence, we maximize thelikelihood of every sentence’s observed Bayesiannetwork, while marginalizing over content selec-tion variables we did not observe.

Because the model only includes pairwise de-pendencies between the hidden alignment vari-ables ~a, the inference problem is quadratic in thenumber of objects and non-convex because ~a isunobserved. We optimize this objective directlywith L-BFGS, using the junction-tree algorithm tocompute the sum and the gradient.6

6To compute the gradient, we differentiate the recurrencein the junction-tree algorithm by applying the product rule.

Inference To describe an image, we need tomaximize over word, alignment, and the depen-dency parse variables:

argmax~w,~a,~d

p(~w,~a, ~d | I)

This computation is intractable because weneed to consider all possible sentences, so we usebeam search for strings up to a fixed length.Reranking Generating directly from the processin Figure 3 results in sentences that may be shortand repetitive because the model score is a productof locally normalized distributions. The rerankertakes as input a candidate list c, for an image I , asdecoded from the generative model. The candidatelist includes the top-k scoring hypotheses for eachsentence length up to a fixed maximum. A linearscoring function is used for reranking optimizedwith MERT (Och, 2003) to maximize BLEU-2.

5 Features

We construct indicator features to capture vari-ation in usage in different parts of the sen-tence, types of objects that are mentioned, visualsalience, and semantic and visual coordination be-tween objects. The features are included in themaximum entropy models used to parameterizethe distributions described in Figure 3. Wheneverpossible, we use WordNet Synsets (Miller, 1995)instead of lexical features to limit over-fitting.

Features in the generative model use tests forlocal properties, such as the identity of a synsetof a word in WordNet, conjoined with an iden-tifier that indicates context in the generative pro-cess.7 Generative model features indicate (1) vi-sual and semantic information about objects in dis-tributions over alignments (content selection) and(2) preferences for referring to objects in distribu-tions over words (content realization). Features inthe reranking model indicate global properties ofcandidate sentences. Exact formulas for comput-ing the features are in the appendix.

Visual features, such as an object’s position inthe image, are used for content selection. Pairwisevisual information between two objects, for exam-ple the bounding box overlap between objects orthe relative position of the two objects, is includedin distributions where selection of an alignment

7For example, in Figure 2 the context for the word “side-walk” would be “word,syntactic-object,verb,preposition” in-dicating it is a word, in the syntactic object of a preposition,which was attached to a verb modifying prepositional phrase.

variable conditions on previously generated align-ments. For verbs (Step 1.d in Figure 3) and prepo-sitions (Step 2.b.ii), these features are conjoinedwith the stem of the connective.

Semantic types of objects are also used in con-tent selection. We define semantic types by findingsynsets of labels in objects that correspond to highlevel types, a list motivated by the animacy hierar-chy (Zaenen et al., 2004).8 Type features indicatethe type of the object referred to by an alignmentvariable as well as the cross product of types whenan alignment variable is on conditioning side ofa distribution (e.g. Step 1.d). Like above, in thepresence of a connective word, these features areconjoined with the stem of the connective.

Content realization features help select wordswhen conditioning on chosen alignments (e.g.Step 1.b). These features include the identity ofthe WordNet synset corresponding to a word, theword’s depth in the synset hierarchy, the languagemodel score for adding that word9 and whether theword matches labels in facets corresponding to theobject referenced by an alignment variable.

Reranking features are primarily used to over-come issues of repetition and length in the genera-tive distributions, more commonly used for align-ment, than to create sentences. We use only fourfeatures: length, the number of repetitions, gener-ative model score, and language model score.

6 Experimental Setup

Data We used 70% of the data for training (1750sentences, 350 images), 15% for development, and15% for testing (375 sentences, 75 images).

Parameters The regularization parameter wasset on the held out data to r = 8. The rerankercandidate list included the top 500 sentences foreach sentence length up to 15 and weights wereoptimized with Z-MERT (Zaidan, 2009).

Metrics Our evaluation is based on BLEU-n(Papineni et al., 2001), which considers all n-grams up to length n. To assess human perfor-mance using BLEU, we score each of the five ref-erences against the four other ones and finally av-erage the five BLEU scores. In order to make theseresults comparable to BLEU scores for our model

8For example, human, animal, artifact (a human createdobject), natural body (trees, water, ect.), or natural artifact(stick, leaf, rock).

9We use tri-grams with Kneser-Ney smoothing over the 1million caption data set (Ordonez et al., 2011).

and baselines, we perform the same five-fold aver-aging when computing BLEU for each system.

We also compute accuracy for different syn-tactic positions in the sentence. We look at anumber of categories: the main clause’s compo-nents (S,V,O), prepositional phrase components,the preposition (Pp) and their objects (Po) andnoun modifying words (N), including determiners.Phrases match if they have an exact string matchand share context identifiers as defined in the fea-tures sections.Human Evaluation Annotators rated sentencesoutput by our full model against either human or abaseline system generated descriptions. Three cri-teria were evaluated: grammaticality, which sen-tence is more complete and well formed; truthful-ness, which sentence is more accurately capturingsomething true in the image; and salience, whichsentence is capturing important things in the imagewhile still being concise. Two annotators anno-tated all test pairs for all criteria for a given pair ofsystems. Six annotators were used (none authors)and agreement was high (Cohen’s kappa = 0.963,0.823 and 0.703 for grammar, truth and salience).Machine Translation Baseline The first base-line is designed to see if it is possible to generategood sentences from the facet string labels alone,with no visual information. We use an extension ofphrase-based machine translation techniques (Ochet al., 1999). We created a virtual bitext by pair-ing each image description (the target sentence)with a sequence10 of visual identifiers (the source“sentence”) listing strings from the facet labels.Since phrases produced by turkers lack many ofthe functions words needed to create fluent sen-tences, we added one of 47 function words eitherat the start or the end of each output phrase.

The translation model included standard fea-tures such as language model score (using our cap-tion language model described previously), wordcount, phrase count, linear distortion, and thecount of deleted source words. We also definethree features that count the number of Object, Isa,and Doing phrases, to learn a preference for typesof phrases. The feature weights are tuned withMERT (Och, 2003) to maximize BLEU-4.Midge Baseline As described in related work,the Midge system creates a set of sentences to de-scribe everything in an input image. These sen-

10We defined a consistent ordering of visual identifiers andset the distortion limit of the phrase-based decoder to infinity.

BL-1 BL-2 BL-3 BL-4Human 61.0 42.0 27.8 18.3Full Model 57.1 35.7 18.3 9.5MT Baseline 39.8 23.6 13.2 6.1Midge Baseline 43.5 20.2 9.4 0.0

Table 1: Results for the test set for the BLEU1-4 metrics.

Grammar Full Other EqualFull vs Human 7.65 19.4 72.94Full vs MT 6.47 5.29 88.23Full vs Midge 40.59 15.88 43.53Truth Full Other EqualFull vs Human 0.59 67.65 31.76Full vs MT 30.0 10.59 59.41Full vs Midge 51.76 27.71 23.53Salience Full Other EqualFull vs Human 8.82 88.24 2.94Full vs MT 51.76 16.47 31.77Full vs Midge 71.18 14.71 14.12

Table 2: Human evaluation of our Full-Model in headsup tests against Human authored sentences and baseline sys-tems, the machine translation baseline (MT) and the Midgeinspired baseline. Bold indicates the better system. Other isnot the Full system. Equal indicates neither sentence is better.

tences must all be true, but do not have to selectthe same content that a person would. It can beadapted to our task by adding object selection andsentence ranking rules. For object selection, wechoose the three most frequently named objectsin the scene according to a background corpus ofimage descriptions. For sentence selection, wetake all sentences within one word of the averagelength of a sentence in our corpus, 11, and selectthe one with best Midge generation score.

7 Results

We report experiments for our generation pipelineand ablations that remove data and features.Overall Performance Table 1 shows the re-sults on the test set. The full model consis-tently achieves the highest BLEU scores. Overall,these numbers suggest strong content selection bygetting high recall for individual words (BLEU-1), but fall further behind human performance asthe length of the n-gram grows (BLEU-2 throughBLEU-4). These number match our perceptionthat the model is learning to produce high qualitysentences, but does not always describe all of theimportant aspects of the scene or use exactly theexpected wording. Table 4 presents example out-put, which we will discuss in more detail shortly.

Model BL-1 BL-2 BL-3 BL-4 S V O Pp Po NHuman 64.7 46.0 31.5 20.1 - - - - - -Full-Model 59.0 36.9 19.3 10.5 64.9 40.4 36.8 50.0 20.7 69.1– doing 51.1 32.6 16.9 9.2 63.2 15.8 10.5 45.5 21.6 69.7– count 55.4 33.5 16.0 8.5 59.6 35.1 15.4 53.7 19.5 66.7– properties 57.8 37.2 18.8 10.0 61.4 36.8 36.8 47.1 20.7 73.5– visual 56.7 35.1 18.9 9.4 64.9 36.8 50.0 41.8 15.3 71.6– pairwise 56.9 35.5 16.5 8.2 64.9 40.4 45.5 42.4 21.2 70.9

Table 3: Ablation results on development data using BLEU1-4 and reporting match accuracy for sentence structures.

S: A girl playing aguitar in the grass

R: A woman with a nylon stringedguitar is playing in a field

S: A man playing with twodogs in the water

R: A man is throwing a log intoa waterway while two dogs watch

S: Two men playing witha bench in the grass

R: Nine men are playing a gamein the park, shirts versus skins

S: Three kids sitting on a road

R: A boy runs in a racewhile onlookers watch

Table 4: Two good examples of output (top), and two ex-amples of poor performance (bottom). Each image has twocaptions, the system output S and a human reference R.

Human Evaluation Table 2 presents the resultsof a human evaluation. The full model outper-forms all baselines on every measure, but is notalways competitive with human descriptions. Itperforms the best on grammaticality, where it isjudged to be as grammatical as humans. How-ever, surprisingly, in many cases it is also oftenjudged equal to the other baselines. Examinationof baseline output reveals that the MT baseline of-ten generates short sentences, having little chanceof being judged ungrammatical. Furthermore, theMidge baseline, like our system, is a syntax-basedsystem and therefore often produces grammaticalsentences. Although our system performs wellwith respect to the baselines on truthfulness, of-ten the system constructs sentences with incorrectprepositions, an issue that could be improved withbetter estimates of 3-d position in the image. Ontruthfulness, the MT baseline is comparable to oursystem, often being judged equal, because its out-put is short. Our system’s strength is salience, afactor the baselines do not model.

Data Ablation Table 3 shows annotation abla-tion experiments on the development set, wherewe remove different classes of data labels to mea-sure the performance that can be achieved withless visual information. In all cases, the overallbehavior of the system varies, as it tries to learn tocompensate for the missing information.

Ablating actions is by far the most detrimental.Overall BLEU score suffers and prediction accu-racy of the verb (V) degrades significantly causingcascading errors that affect the object of the verb(O). Removing count information affects noun at-tachment (N) performance. Images where deter-miner use is important or where groups of objectsare best identified by the number (for example,three dogs) are difficult to describe naturally. Fi-nally, we see a tradeoff when removing properties.There is an increase in noun modifier accuracy (N)but a decrease in content selection quality (BL-1),showing recall has gone down. In essence, the ap-proach learns to stop trying to generate adjectivesand other modifiers that would rely on the missingproperties. The difference in BLEU score with theFull-Model is small, even without these modifiers,because there often still exists a a short output withhigh accuracy.

Feature Ablation The bottom two rows in Ta-ble 3 show ablations of the visual and pairwisefeatures, measuring the contribution of the visualinformation provided by the bounding box anno-tations. The ablated visual information includesbounding-box positions and relative pairwise vi-sual information. The pairwise ablation removesthe ability to model any interactions between ob-jects, for example, relative bounding box or pair-wise object type information.

Overall, prepositional phrase accuracy is mostaffected. Ablating visual features significantly im-pacts accuracy of prepositional phrases (Pp andPo), affecting the use of preposition words themost, and lowering fluency (BL-4). Precision in

the object of the verb (O) rises; the model makes∼ 50% fewer predictions in that position than theFull-Model because it lacks features to coordinatesubject and object of the verb. Ablating pairwisefeatures has similar results. While the model cor-rects errors in the object of the preposition (Po)with the addition of visual features, fluency is stillworse than Full-Model, as reflected by BL-4.Qualitative Results Table 4 has examples ofgood and bad system output. The first two im-ages are good examples, including both systemoutput (S) and a human reference (R). The sec-ond two contain lower quality outputs. Overall,the model captures common ways to refer to peo-ple and scenes. However, it does better for imageswith fewer sentient objects because content selec-tion is less ambiguous.

Our system does well at finding important ob-jects. For example, in the first good image, wemention the guitar instead of the house, both ofwhich are prominent and have high overlap withthe woman. In the second case, we identify thatboth dogs and humans tend to be important actorsin scenes but poorly identify their relationship.

The bad examples show difficult scenes. In thefirst description the broad context is not identi-fied, instead focusing on the bench (highlighted inred). The second example identifies a weaknessin our annotation: it encodes contradictory group-ings of the people. The groupings covers all ofthe children, including the boy running, and manysubsets of the people near the grass. This causesambiguity and our methods cannot differentiatethem, incorrectly mentioning just the children andpicking an inappropriate verb (one participant inthe group is not sitting). Improved annotation ofgroups would enable the study of generation formore complex scenes, such as these.

8 Conclusion

In this work we used dense annotations of imagesto study description generation. The annotationsallowed us to not only develop new models, bettercapable of generating human-like sentences, butalso to explore what visual information is crucialfor description generation. Experiments showedthat activity and bounding-box information is im-portant and demonstrated areas of future work. Inimages that are more complex, for example multi-ple sentient objects, object grouping and referencewill be important to generating good descriptions.

Issues of this type can be explored with annota-tions of increasing complexity.

Appendix AThis appendix describes the feature templates forthe generative model in greater detail.

Features in the generative model conjoin indica-tors for local tests, such as STEM(w) which in-dicates the stem of a wordw, with a global contex-tual identifier CONTEXT(v, d) that indicatesproperties of the generation history, as describedin detail below. Table 5 provides a reference forwhich feature templates are used in the generativemodel distributions, as defined in Figure 3.

8.1 Feature Templates

CONTEXT(n, d) is an indicator for a contex-tual identifier for a variable n in the model de-pending on the dependency structure d. There isan indicator for all combinations of the type of n(alignment or word), the position of n (subject,syntactic object, verb, noun-modifier, or preposi-tion), the position of the earliest variable alongthe path to generate n, and the type of attach-ment to that variable (noun or prepositional mod-ifier). For example, in Figure 2 the context forthe word “sidewalk” would be “word,syntactic-object,verb,preposition” indicating it is a word, theobject of a preposition, whose path was along averb modifying prepositional phrase.11

TYPE(a) indicates the high level type of anobject referred to by alignment variable a. Weuse synsets to define high level types includinghuman, animal, artifact, natural artifact and var-ious synsets that capture scene information,12 alist motivated by the animacy hierarchy (Zaenenet al., 2004). Each object is assigned a type byfinding the synset for its name (object facet), andtracing the hypernym structure in Wordnet to findthe appropriate class, if one exists. Additionally,the type indicates whether the object is a group ornot. For example, in Figure 2, the blue polygonhas type “person,group”, or the red bike polygonhas type “artifact,single.”

11Similarly “large” is “word,noun,subject,preposition”while “girls” is special cased to “word,subject,root” be-cause it has no initial attachment. The alignment vari-able above the word handbags is “alignment,syntactic-object,subject,preposition” because it an alignment variable,is in the syntactic object position of a preposition and can belocated by following a subject attached pp.

12WordNet divides these into synsets expressing water,weather, nature and a few more.

Feature Family Included In StepsCONTEXT(a′, ~dc)⊗

{TYPE(a′),MENTION(a′, do),MENTION(a′, obj ),VISUAL(a′)}pa(a

′|~dc)pa(a

′ | a,w, ~dc)

1.a, 1.d, 2.b.ii

CONTEXT(a′, ~dc)⊗ {TYPE(a)⊗TYPE(a′),VISUAL2(a, a′)} pa(a′ | a,w, ~dc) 1.d, 2.b.i

CONTEXT(a′, ~dc)⊗{TYPE(a)⊗TYPE(a′)⊗ STEM(w),VISUAL2(a, a′)⊗ STEM(w)}

pa(a′ | a,w, ~dc) 1.d, 2.b.i

CONTEXT(a, ~dc)⊗{WORDNET(w),MATCH(w, a),SPECIFICITY(w, a),

ADJECTIVE(w, a),DETERMINER(w, a)}

pn(w | a, ~dc) 1.b, 1.e, 2.a.i2.b.ii

CONTEXT(a, ~dc)⊗ {MATCH(w, a),TYPE(a)⊗ STEM(w)} pv(w | a, ~dc) 1.cCONTEXT(a′, ~dc)⊗TYPE(a)⊗ STEM(wp) pp(w | a, ~dc)

pp(w | a,wv, ~dc)

2.b.i

CONTEXT(a′, ~dc)⊗ STEM(wv)⊗ STEM(w) pp(w | a,wv, ~dc) 2.b.i

Table 5: Feature families and distributions that include them. ⊗ indicates the cross-product of the indi-cator features. Distributions are listed more than once to indicate they use multiple feature families.

VISUAL(a) returns indicators for visual factsabout the object that a aligns to. There is an in-dicator for two quantities: (1) overlap of object’spolygon with every horizontal third of the image,as a fraction of the object’s area, and (2) the ob-ject’s distance to the center of the image as frac-tion of the diagonal of the image. Each quantity,v, is put into three overlapping buckets: if v > .1,if v > .5, and if v > .9.

VISUAL2(a, a′) indicates pairwise visualfacts about two objects. There is an indicator forthe following quantities bucketed: the amount ofoverlap between the polygons for a and a′ as afraction of the size of a’s polygon, the distancebetween the center of the polygon for a and a′ asa fraction of image’s diagonal, and the slope be-tween the center of a and a′. Each quantity, v, isput into three overlapping buckets: if v > .1, ifv > .5, and if v > .9. There is an indicator for therelative position of extremities a and a′: whetherthe rightmost point of a is further right than a′’srightmost or leftmost point, and the same for top,left, and bottom.

WORDNET(w) returns indicators for all hy-pernyms of a word w. The two most specificsynsets are not used when there at least 8 options.

MENTION(a, facet) returns the union of theWORDNET(w) features for all words w in thefacet facet for the object referred to alignment a.

ADJECTIVE(w, a) indicates four typesof features specific to adjective usage. IfMENTION(w,Attributes) is not empty, indi-cate : (1) the satellite adjective synset of w inWordnet, (2) the head adjective synset of w inWordnet, (3) the head adjective synset conjoinedwith TYPE(a), and (4) the number of times thereexists a label in the Attributes facet of a that has

the same head adjective synset as w.DETERMINER(w, a) indicates four deter-

miner specific features. If w is a determiner, thenindicate : (1) the identity of w conjoined with thecount (the label for numerosity) of a, (2) the iden-tity of w conjoined with an indicator for if thecount of a is greater than one, (3) the identity of wconjoined with TYPE(a) and (4) the frequencywith which w appears before its head word in theFlikr corpus (Ordonez et al., 2011).MATCH(w, a), indicates all facets of object

a that contain words with the same stem as w.SPECIFICITY(w, a) is an indicator of the

specificity of the word w when referring to the ob-ject aligned to a. Indicates the relative depth ofw in Wordnet, as compared to all words w′ whereMATCH(w′, a) is not empty. The depth is buck-eted into quintiles.STEM(w) returns the Porter2 stem of w.13

The distribution for stopping, pstop(STOP |~dc, ~w), contains two types of features. (1) Struc-tural features indicating for the number of timesa contextual identifier has appeared so far in thederivation and (2) mention features indicating thetypes of objects mentioned.14 To compute men-tion features, we consider all possible types of ob-jects, t, then there is an indicator for: (1) if ∃o, ∃w ∈~w : MATCH(w, o) 6= ∅ ∧TYPE(o) = t, (2) whether∃o, 6 ∃w ∈ ~w : MATCH(w, o) 6= ∅∧TYPE(o) = t and(3) if (1) does not hold but (2) does.Acknowledgments This work is partially funded by DARPA

CSSG (D11AP00277) and ARO (W911NF-12-1-0197). We

thank L. Zitnick, B. Dolan, M. Mitchell, C. Quirk, A. Farhadi,

B. Russell for helpful conversations. Also, L. Zilles, Y. Atrzi,

N. FitzGerald, T. Kwiatkowski and reviewers for comments.

13http://snowball.tartarus.org/algorithms/english/stemmer.html14Object mention features cannot contain ~a because that

creates large dependencies in inference for learning.

ReferencesGabor Angeli, Percy Liang, and Dan Klein. 2010. A

simple domain-independent probabilistic approachto generation. In Empirical Methods in Natural Lan-guage Processing (EMNLP).

Chris Callison-Burch. 2009. Fast, cheap, and creative:Evaluating translation quality using Amazon’s Me-chanical Turk. In EMNLP, pages 286–295, August.

David L. Chen, Joohyun Kim, and Raymond J.Mooney. 2010. Training a multilingual sportscaster:Using perceptual context to learn language. JAIR,37:397–435.

Marie-Catherine de Marneffe, Bill MacCartney,Christopher D Manning, et al. 2006. Generatingtyped dependency parses from phrase structureparses. In LREC, volume 6, pages 449–454.

Desmond Elliott and Frank Keller. 2013. Image De-scription using Visual Dependency Representations.In EMNLP.

Ali Farhadi, Ian Endres, Derek Hoiem, and DavidForsyth. 2009. Describing objects by their at-tributes. In Computer Vision and Pattern Recogni-tion, 2009. CVPR 2009. IEEE Conference on, pages1778–1785. IEEE.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In Proceedings of the 11th European conference onComputer Vision, ECCV’10, pages 15–29.

Pedro F Felzenszwalb, Ross B Girshick, DavidMcAllester, and Deva Ramanan. 2010. Objectdetection with discriminatively trained part-basedmodels. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 32(9):1627–1645.

Yansong Feng and Mirella Lapata. 2010. How manywords is a picture worth? Automatic caption gener-ation for news images. In ACL, pages 1239–1249.

Nicholas FitzGerald, Yoav Artzi, and Luke Zettle-moyer. 2013. Learning distributions over logi-cal forms for referring expression generation. InEMNLP.

Ankush Gupta and Prashanth Mannem. 2012. Fromimage annotation to image description. In NIPS,volume 7667, pages 196–204.

Ioannis Konstas and Mirella Lapata. 2012. Concept-to-text generation via discriminative reranking. InACL, pages 369–378.

Emiel Krahmer and Kees Van Deemter. 2012. Compu-tational generation of referring expressions: A sur-vey. Computational Linguistics, 38(1):173–218.

Niveda Krishnamoorthy, Girish Malkarnenkar, Ray-mond Mooney, Kate Saenko, and Sergio Guadar-rama. 2013. Generating natural-language video de-scriptions using text-mined knowledge. Procedingsof AAAI, 2013(2):3.

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C.Berg, and T.L. Berg. 2011. Baby talk: Understand-ing and generating simple image descriptions. InComputer Vision and Pattern Recognition (CVPR),pages 1601–1608.

Polina Kuznetsova, Vicente Ordonez, Alexander CBerg, Tamara L Berg, and Yejin Choi. 2012. Col-lective generation of natural image descriptions. InACL, pages 359–368.

Li-Jia Li and Li Fei-Fei. 2007. What, where and who?Classifying events by scene and object recognition.In ICCV, pages 1–8. IEEE.

Ken McRae, George S. Cree, Mark S. Seidenberg, andChris Mcnorgan. 2005. Semantic feature produc-tion norms for a large set of living and nonlivingthings. Behavior Research Methods, 37(4):547–559.

George A Miller. 1995. WordNet: a lexicaldatabase for english. Communications of the ACM,38(11):39–41.

Margaret Mitchell, Xufeng Han, Jesse Dodge, AlyssaMensch, Amit Goyal, Alex Berg, Kota Yamaguchi,Tamara Berg, Karl Stratos, and Hal Daume, III.2012. Midge: Generating image descriptions fromcomputer vision detections. In EACL, pages 747–756.

Margaret Mitchell, Kees van Deemter, and Ehud Re-iter. 2013. Generating expressions that refer to vis-ible objects. In Proceedings of NAACL-HLT, pages1174–1184.

F. Och, C. Tillmann, and H. Ney. 1999. Improvedalignment models for statistical machine translation.In Proc. of the Joint Conf. of Empirical Methods inNatural Language Processing and Very Large Cor-pora, pages 20–28.

Franz Josef Och. 2003. Minimum error rate training instatistical machine translation. In ACL, pages 160–167.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.2011. Im2Text: Describing images using 1 millioncaptioned photographs. In NIPS, pages 1143–1151.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: a method for automaticevaluation of machine translation. In Proc. of ACL.

C. Rashtchian, P. Young, M. Hodosh, and J. Hock-enmaier. 2010. Collecting image annotations us-ing Amazon’s Mechanical Turk. In Proceedingsof the NAACL HLT 2010 Workshop on CreatingSpeech and Language Data with Amazon’s Mechan-ical Turk, pages 139–147.

Gaurav Sharma, Frederic Jurie, Cordelia Schmid, et al.2013. Expanded parts model for human attributeand action recognition in still images. In CVPR.

Carina Silberer and Mirella Lapata. 2012. Groundedmodels of semantic representation. In EMNLP, July.

Carina Silberer, Vittorio Ferrari, and Mirella Lapata.2013. Models of semantic representation with visualattributes. In ACL, pages 572–582.

Antonio Torralba, Bryan C Russell, and Jenny Yuen.2010. LabelMe: Online image annotation and appli-cations. Proceedings of the IEEE, 98(8):1467–1484.

Daniel Weinland, Remi Ronfard, and Edmond Boyer.2011. A survey of vision-based methods for actionrepresentation, segmentation and recognition. Com-puter Vision and Image Understanding, 115(2):224–241.

Yezhou Yang, Ching Lik Teo, Hal Daume III, and Yian-nis Aloimonos. 2011. Corpus-guided sentence gen-eration of natural images. In Empirical Methods inNatural Language Processing.

Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy LaiLin, Leonidas J. Guibas, and Li Fei-Fei. 2011.Action recognition by learning bases of action at-tributes and parts. In ICCV, Barcelona, Spain,November.

Haonan Yu and Jeffrey Mark Siskind. 2013. Groundedlanguage learning from video described with sen-tences. In Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguis-tics, volume 1, pages 53–63.

Annie Zaenen, Jean Carletta, Gregory Garretson,Joan Bresnan, Andrew Koontz-Garboden, TatianaNikitina, M Catherine O’Connor, and Tom Wasow.2004. Animacy encoding in English: why and how.In ACL Workshop on Discourse Annotation, pages118–125.

Omar F. Zaidan. 2009. Z-MERT: A fully configurableopen source tool for minimum error rate training ofmachine translation systems. The Prague Bulletin ofMathematical Linguistics, 91:79–88.

C. Lawrence Zitnick and Devi Parikh. 2013. Bring-ing semantics into focus using visual abstraction. InCVPR.

see no evil, say no evil: description generation from...

Documents