punny captions: witty wordplay in image descriptionsproceedings of naacl-hlt 2018, pages 770–775...

Proceedings of NAACL-HLT 2018, pages 770–775New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

Punny Captions: Witty Wordplay in Image Descriptions

Arjun Chandrasekaran1 Devi Parikh1,2 Mohit Bansal31Georgia Institute of Technology 2Facebook AI Research 3UNC Chapel Hill

{carjun, parikh}@gatech.edu [email protected]

Abstract

Wit is a form of rich interaction that is oftengrounded in a specific situation (e.g., a com-ment in response to an event). In this work,we attempt to build computational models thatcan produce witty descriptions for a given im-age. Inspired by a cognitive account of hu-mor appreciation, we employ linguistic word-play, specifically puns, in image descriptions.We develop two approaches which involve re-trieving witty descriptions for a given imagefrom a large corpus of sentences, or generat-ing them via an encoder-decoder neural net-work architecture. We compare our approachagainst meaningful baseline approaches viahuman studies and show substantial improve-ments. We find that when a human is sub-ject to similar constraints as the model regard-ing word usage and style, people vote the im-age descriptions generated by our model to beslightly wittier than human-written witty de-scriptions. Unsurprisingly, humans are almostalways wittier than the model when they arefree to choose the vocabulary, style, etc.

1 Introduction“Wit is the sudden marriage of ideas which beforetheir union were not perceived to have any rela-tion.” – Mark Twain. Witty remarks are oftencontextual, i.e., grounded in a specific situation.Developing computational models that can emu-late rich forms of interaction like contextual hu-mor, is a crucial step towards making human-AIinteraction more natural and more engaging (Yuet al., 2016). E.g., witty chatbots could help re-lieve stress and increase user engagement by beingmore personable and human-like. Bots could au-tomatically post witty comments (or suggest wittyresponses) on social media, chat, or messaging.

The absence of large scale corpora of witty cap-tions and the prohibitive cost of collecting such adataset (being witty is harder than just describing

(a) Generated: a poll (pole)on a city street at night.Retrieved: the light knight(night) chuckled.Human: the knight (night)in shining armor drove away.

(b) Generated: a bare (bear)black bear walking through aforest.Retrieved: another reporter isstanding in a bare (bear) brownfield.Human: the bear killed thelion with its bare (bear) hands.

Figure 1: Sample images and witty descriptions from 2models, and a human. The words inside ‘()’ (e.g., poleand bear) are the puns associated with the image, i.e.,the source of the unexpected puns used in the caption(e.g., poll and bare).

an image) makes the problem of producing con-textually witty image descriptions challenging.

In this work, we attempt to tackle the challeng-ing task of producing witty (pun-based) remarksfor a given (possibly boring) image. Our approachis inspired by a two-stage cognitive account of hu-mor appreciation (Suls, 1972) which states that aperceiver experiences humor when a stimulus suchas a joke, captioned cartoon, etc., causes an incon-gruity, which is shortly followed by resolution.

We introduce an incongruity in the perceiver’smind while describing an image by using an un-expected word that is phonetically similar (pun)to a concept related to the image. E.g., in Fig. 1b,the expectations of a perceiver regarding the image(bear, stones, etc.) is momentarily disconfirmedby the (phonetically similar) word ‘bare’. Thisincongruity is resolved when the perceiver parsesthe entire image description. The incongruity fol-lowed by resolution can be perceived to be witty.1

1Indeed, a perceiver may fail to appreciate wit if the pro-

770

We build two computational models based onthis approach to produce witty descriptions for animage. First, a model that retrieves sentences con-taining a pun that are relevant to the image from alarge corpus of stories (Zhu et al., 2015). Second,a model that generates witty descriptions for animage using a modified inference procedure dur-ing image captioning which includes the specifiedpun word in the description.

Our paper makes the following contributions:To the best of our knowledge, this is the first workthat tackles the challenging problem of produc-ing a witty natural language remark in an every-day (boring) context. We present two novel mod-els to produce witty (pun-based) captions for anovel (likely boring) image. Our models rely onlinguistic wordplay. They use an unexpected punin an image description during inference/retrieval.Thus, they do not require to be trained with wittycaptions. Humans vote the descriptions from thetop-ranked generated captions ‘wittier’ than threebaseline approaches. Moreover, in a Turing test-style evaluation, our model’s best image descrip-tion is found to be wittier than a witty human-written caption2 55% of the time when the humanis subject to the same constraints as the machineregarding word usage and style.

2 Related WorkHumor theory. General Theory of Verbal Hu-mor (Attardo and Raskin, 1991) characterizes lin-guistic stimuli that induce humor but implement-ing computational models of it requires severelyrestricting its assumptions (Binsted, 1996).Puns. Zwicky and Zwicky (1986) classify punsas perfect (pronounced exactly the same) or im-perfect (pronounced differently). Similarly, Pepi-cello and Green (1984) categorize riddles based onthe linguistic ambiguity that they exploit – phono-logical, morphological or syntactic. Jaech et al.(2016) learn phone-edit distances to predict thecounterpart, given a pun by drawing from auto-matic speech recognition techniques. In contrast,we augment a web-scraped list of puns using anexisting model of pronunciation similarity.Generating textual humor. JAPE (Binsted andRitchie, 1997) also uses phonological ambiguityto generate pun-based riddles. While our task in-volves producing free-form responses to a novel

cess of ‘solving’ (resolution) is trivial (the joke is obvious) ortoo complex (they do not ‘get’ the joke).

2 This data is available on the author’s webpage.

stimulus, JAPE produces stand-alone “canned”jokes. HAHAcronym (Stock and Strapparava,2005) generates a funny expansion of a givenacronym. Unlike our work, HAHAcronym oper-ates on text, and is limited to producing sets ofwords. Petrovic and Matthews (2013) developan unsupervised model that produces jokes of theform, “I like my X like I like my Y, Z” .Generating multi-modal humor. Wang and Wen(2015) predict a meme’s text based on a givenfunny image. Similarly, Shahaf et al. (2015)and Radev et al. (2015) learn to rank cartoon cap-tions based on their funniness. Unlike typical, bor-ing images in our task, memes and cartoons areimages that are already funny or atypical. E.g.,“LOL-cats” (funny cat photos), “Bieber-memes”(modified pictures of Justin Bieber), cartoons withtalking animals, etc. Chandrasekaran et al. (2016)alter an abstract scene to make it more funny. Incomparison, our task is to generate witty naturallanguage remarks for a novel image.Poetry generation. Although our tasks are differ-ent, our generation approach is conceptually simi-lar to Ghazvininejad et al. (2016) who produce po-etry, given a topic. While they also generate andscore a set of candidates, their approach involvesmany more constraints and utilizes a finite stateacceptor unlike our approach which enforces con-straints during beam search of the RNN decoder.

3 ApproachExtracting tags. The first step in producing a con-textually witty remark is to identify concepts thatare relevant to the context (image). At times, theseconcepts are directly available as e.g., tags postedon social media. We consider the general casewhere such tags are unavailable, and automaticallyextract tags associated with an image.

We extract the top-5 object categories pre-dicted by a state-of-the-art Inception-ResNet-v2model (Szegedy et al., 2017) trained for imageclassification on ImageNet (Deng et al., 2009). Wealso consider the words from a (boring) image de-scription (generated from Vinyals et al. (2016)).We combine the classifier object labels and wordsfrom the caption (ignoring stopwords) to producea set of tags associated with an image, as shown inFig. 2. We then identify concepts from this collec-tion that can potentially induce wit.Identifying puns. We attempt to induce an incon-gruity by using a pun in the image description. Weidentify candidate words for linguistic wordplay

771

Classifierman cell phone

a group of

standing

cell

TagsInput

CNN fRNN fRNN fRNN fRNN fRNN

sell

people

side

with

theat side (sighed)

fRNN

selloutdoor

fRNN

phones

fRNN

sell<stop>

rRNN

phones cell

rRNN rRNN

their

rRNN

fRNN

people

rRNN

of group

rRNN rRNN

a

rRNN

sell<stop>

their atpeople

sell (cell)

an event

rRNN

Filter puns

sell

sell

The street signs read thirty-eighth and eighth avenue.

“I have decided to sell the group to you .”

“you sell phones ?”

Retrieval corpus

phones

Image captioning

model

Gen

erat

ion

Sentence reads this way

Figure 2: Our models for generating and retrieving image descriptions containing a pun (see Sec. 3).

by comparing image tags against a list of puns.

We construct the list of puns by mining the webfor differently spelled words that sound exactly thesame (heterographic homophones). We increasecoverage by also considering pairs of words with 0edit-distance, according to a metric based on fine-grained articulatory representations (AR) of wordpronunciations (Jyothi and Livescu, 2014). Ourlist of puns has a total of 1067 unique words (931from the web and 136 from the AR-based model).

The pun list yields a set of puns that are as-sociated with a given image and their phonolog-ically identical counterparts, which together formthe pun vocabulary for the image. We evaluate ourapproach on the subset of images that have non-empty pun vocabularies (about 2 in 5 images).

Generating punny image captions. We intro-duce an incongruity by forcing a vanilla imagecaptioning model (Vinyals et al., 2016) to decodea phonological counterpart of a pun word associ-ated with the image, at a specific time-step dur-ing inference (e.g., ‘sell’ or ‘sighed’, showed inorange in Fig. 2). We achieve this by limiting thevocabulary of the decoder at that time-step to onlycontain counterparts of image-puns. In followingtime-steps, the decoder generates new words con-ditioned on all previously decoded words. Thus,the decoder attempts to generate sentences thatflow well based on previously uttered words.

We train two models that decode an image de-scription in forward (start to end) and reverse(end to start) directions, depicted as ‘fRNN’ and‘rRNN’ in Fig. 2 respectively. The fRNN can de-code words after accounting for the incongruitythat occurs early in the sentence and the rRNN isable to decode the early words in the sentence af-ter accounting for the incongruity that can occurlater. The forward RNN and reverse RNN gener-ate sentences in which the pun appears in each of

the first T and last T positions, respectively.3

Retrieving punny image captions. As an al-ternative to our approach of generating witty re-marks for the given image, we also attempt toleverage natural, human-written sentences whichare relevant (yet unexpected) in the given con-text. Concretely, we retrieve natural languagesentences4 from a combination of the Book Cor-pus (Zhu et al., 2015) and corpora from the NLTKtoolkit (Loper and Bird, 2002). The retrievedsentences each (a) contains an incongruity (pun)whose counterpart is associated with the image,and (b) has support in the image (contains an im-age tag). This yields a pool of candidate captionsthat are perfectly grammatical, a little unexpected,and somewhat relevant to the image (see Sec. 4).Ranking. We rank captions in the candidate poolsfrom both generation and retrieval models, accord-ing to their log-probability score under the imagecaptioning model. We observe that the higher-ranked descriptions are more relevant to the imageand grammatically correct. We then perform non-maximal suppression, i.e., eliminate captions thatare similar5 to a higher-ranked caption to reducethe pool to a smaller, more diverse set. We reportresults on the 3 top-ranked captions. We describethe effect of design choices in the supplementary.

4 ResultsData. We evaluate witty captions from our ap-proach via human studies. 100 random images(having associated puns) are sampled from the val-

3For an image, we choose T = {1, 2, ..., 5} and beamsize = 6 for each decoder. This generates a pool of 5 (T) ∗ 6(beam size) ∗ 2 (forward + reverse decoder) = 60 candidates.

4To prevent the context of the sentence from distractingthe perceiver, we consider sentences with < 15 words. Over-all, we are left with a corpus of about 13.5 million sentences.

5Two sentences are similar if the cosine similarity be-tween the average of the Word2Vec (Mikolov et al., 2013)representations of words in each sentence is ≥ 0.8.

772

idation set of COCO (Lin et al., 2014).Baselines. We compare the wittiness of descrip-tions generated by our model against 3 qualita-tively different baselines, and a human-writtenwitty description of an image. Each of these evalu-ates a different component of our approach. Reg-ular inference generates a fluent caption that isrelevant to the image but is not attempting to bewitty. Witty mismatch is a human-written wittycaption, but for a different image from the one be-ing evaluated. This baseline results in a captionthat is intended to be witty, but does not attempt tobe relevant to the image. Ambiguous is a ‘punny’caption where a pun word in the boring (regular)caption is replaced by its counterpart. This cap-tion is likely to contain content that is relevant tothe image, and it contains a pun. However, the punis not being used in a fluent manner.

We evaluate the image-relevance of the topwitty caption by comparing against a boring ma-chine caption and a random caption (see supple-mentary).Evaluation annotations. Our task is to gener-ate captions that a layperson might find witty. Toevaluate performance on this task, we ask peopleon Amazon Mechanical Turk (AMT) to vote forthe wittier among the given pair of captions foran image. We collect annotations from 9 uniqueworkers for each relative choice and take the ma-jority vote as ground-truth. For each image, wecompare each of the generated 3 top-ranked and1 low-ranked caption against 3 baseline captionsand 1 human-written witty caption.6

Constrained human-written witty captions. Weevaluate the ability of humans and automaticmethods to use the given context and pun wordsto produce a caption that is perceived as witty. Weask subjects on AMT to describe a given imagein a witty manner. To prevent observable struc-tural differences between machine and human-written captions, we ensure consistent pun vocab-ulary (utilization of pre-specified puns for a givenimage). We also ask people to avoid first personaccounts or quote characters in the image.Metric. 3, we report performance of the gener-ation approach using the Recall@K metric. For

6This results in a total of 4 (captions) ∗2 (generation +retrieval) ∗4 (baselines + human caption) = 32 comparisonsof our approach against baselines. We also compare the wit-tiness of the 4 generated captions against the 4 retrieved cap-tions (see supplementary) for an image (16 comparisons). Intotal, we perform 48 comparisons per image, for 100 images.

Figure 3: Wittiness of top-3 generated captions vs.other approaches. y-axis measures the % images forwhich at least one of K captions from our approachis rated wittier than other approaches. Recall steadilyincreases with the number of generated captions (K).

K = 1, 2, 3, we plot the percentage of imagesfor which at least one of the K ‘best’ descriptionsfrom our model outperformed another approach.Generated captions vs. baselines. As we see inFig. 3, the top generated image description (top-1G) is perceived as wittier compared to all base-line approaches more often than not (the vote is>50% at K = 1). We observe that as K in-creases, the recall steadily increases, i.e., when weconsider the top K generated captions, increas-ingly often, humans find at least one of them tobe wittier than captions produced by baseline ap-proaches. People find the top-1G for a given imageto be wittier than mismatched human-written im-age captions, about 95% of the time. The top-1G isalso wittier than a naive approach that introducesambiguity about 54.2% of the time. When com-pared to a typical, boring caption, the generatedcaptions are wittier 68% of the time. Further, in ahead-to-head comparison, the generated captionsare wittier than the retrieved captions 67.7% of thetime. We also validate our choice of ranking cap-tions based on the image captioning model score.We observe that a ‘bad’ caption, i.e., one rankedlower by our model, is significantly less witty thanthe top 3 output captions.

Surprisingly, when the human is constrained touse the same words and style as the model, thegenerated descriptions from the model are foundto be wittier for 55% of the images. Note that in aTuring test, a machine would equal human perfor-mance at 50%7. This led us to speculate if the con-

7Recall that this compares how a witty description is con-structed, given the image and specific pun words. A Turingtest-style evaluation that compares the overall wittiness of amachine and a human would refrain from constraining thehuman in any way.

773

(a) Generated: a bored(board) bench sits in front ofa window.Retrieved: Wedge sits onthe bench opposite Berry,bored (board).Human: could you pleasemake your pleas (please)!

(b) Generated: a loop(loupe) of flowers in a glassvase.Retrieved: the flour (flower)inside teemed with worms.Human: piece required forpeace (piece).

(c) Generated: a woman sell(cell) her cell phone in a city.Retrieved: Wright (right)slammed down the phone.Human: a woman sighed(side) as she regretted thesell.

(d) Generated: a bear that isbare (bear) in the water.Retrieved: water glistenedoff her bare (bear) breast.Human: you won’t hear acreak (creek) when the bearis feasting.

(e) Generated: a loop(loupe) of scissors and a pairof scissors.Retrieved: i continuedslicing my pear (pair) on thecutting board.Human: the scissors werenear, but not clothes (close).

(f) Generated: a female ten-nis player caught (court) inmid swing.Retrieved: i caught (court)thieves on the roof top.Human: the man made aloud bawl (ball) when shethrew the ball.

(g) Generated: a bored(board) living room with alarge window.Retrieved: anya sat on thecouch, feeling bored (board).Human: the sealing (ceil-ing) on the envelope resem-bled that in the ceiling.

(h) Generated: a parkingmeter with rode (road) in thebackground.Retrieved: smoke speakersighed (side).Human: a nitting of colordidn’t make the poll (pole)less black.

Figure 4: The top row contains selected examples of human-written witty captions, and witty captions generatedand retrieved from our models. The examples in the bottom row are randomly picked.

straints placed on language and style might be re-stricting people’s ability to be witty. We confirmedthis by evaluating free-form human captions.Free-form Human-written Witty Captions. Weask people on AMT to describe an image (usingany vocabulary) in a manner that would be per-ceived as funny. As expected, when comparedagainst automatic captions from our approach, hu-man evaluators find free-form human captions tobe wittier about 90% of the time compared to 45%in the case of constrained human witty captions.Clearly, human-level creative language with un-constrained sentence length, style, choice of puns,etc., makes a significant difference in the witti-ness of a description. In contrast, our automaticapproach is constrained by caption-like language,length, and a word-based pun list. Training mod-els to intelligently navigate this creative freedomis an exciting open challenge.Qualitative analysis. The generated witty cap-tions exhibit interesting features like alliteration(‘a bare black bear ...’) in Fig. 1b and 4c. Attimes, both the original pun (pole) and its counter-

part (poll) make sense for the image (Fig. 1a). Oc-casionally, a pun is naively replaced by its counter-part (Fig. 4a) or rare puns are used (Fig. 4b). Onthe other hand, some descriptions (Fig. 4e and 4h)that are forced to utilize puns do not make sense.See supplementary for analysis of retrieval model.

5 ConclusionWe presented novel computational models in-spired by cognitive accounts to address the chal-lenging task of producing contextually witty de-scriptions for a given image. We evaluate the mod-els via human-studies, in which they significantlyoutperform meaningful baseline approaches.

AcknowledgementsWe thank Shubham Toshniwal for his advice regard-ing the automatic speech recognition model. Thiswork was supported in part by: a NSF CAREERaward, ONR YIP award, ONR Grant N00014-14-12713, PGA Family Foundation award, Google FRA,Amazon ARA, DARPA XAI grant to DP and NVIDIAGPU donations, Google FRA, IBM Faculty Award, andBloomberg Data Science Research Grant to MB.

774

ReferencesSalvatore Attardo and Victor Raskin. 1991. Script the-

ory revis (it) ed: Joke similarity and joke representa-tion model. Humor-International Journal of HumorResearch 4(3-4):293–348.

Kim Binsted. 1996. Machine humour: An imple-mented model of puns. PhD Thesis, University ofEdinburgh .

Kim Binsted and Graeme Ritchie. 1997. Computa-tional rules for generating punning riddles. Humor:International Journal of Humor Research .

Arjun Chandrasekaran, Ashwin Kalyan, Stanislaw An-tol, Mohit Bansal, Dhruv Batra, C. Lawrence Zit-nick, and Devi Parikh. 2016. We are humor be-ings: Understanding and predicting visual humor. InCVPR.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hi-erarchical image database. In Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Con-ference on. IEEE, pages 248–255.

Marjan Ghazvininejad, Xing Shi, Yejin Choi, andKevin Knight. 2016. Generating topical poetry. InEMNLP.

Aaron Jaech, Rik Koncel-Kedziorski, and Mari Osten-dorf. 2016. Phonological pun-derstanding. In HLT-NAACL.

Preethi Jyothi and Karen Livescu. 2014. Revisitingword neighborhoods for speech recognition. ACL2014 page 1.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European Confer-ence on Computer Vision. Springer, pages 740–755.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In Proceedings of the ACL-02Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Com-putational Linguistics - Volume 1. Association forComputational Linguistics, ETMTNLP ’02.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems.

William J Pepicello and Thomas A Green. 1984. Lan-guage of riddles: new perspectives. The Ohio StateUniversity Press.

Sasa Petrovic and David Matthews. 2013. Unsuper-vised joke generation from big data. In ACL.

Dragomir Radev, Amanda Stent, Joel Tetreault, Aa-sish Pappu, Aikaterini Iliakopoulou, Agustin Chan-freau, Paloma de Juan, Jordi Vallmitjana, AlejandroJaimes, Rahul Jha, et al. 2015. Humor in collectivediscourse: Unsupervised funniness detection in thenew yorker cartoon caption contest. arXiv preprintarXiv:1506.08126 .

Dafna Shahaf, Eric Horvitz, and Robert Mankoff.2015. Inside jokes: Identifying humorous cartooncaptions. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining. ACM, pages 1065–1074.

Oliviero Stock and Carlo Strapparava. 2005. HA-HAcronym: A computational humor system. InACL.

Jerry M Suls. 1972. A two-stage model for the ap-preciation of jokes and cartoons: An information-processing analysis. The Psychology of Humor:Theoretical Perspectives and Empirical Issues .

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,and Alexander A Alemi. 2017. Inception-v4,inception-resnet and the impact of residual connec-tions on learning. In AAAI. pages 4278–4284.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2016. Show and tell: Lessonslearned from the 2015 mscoco image captioningchallenge. IEEE transactions on pattern analysisand machine intelligence .

William Yang Wang and Miaomiao Wen. 2015. Ican has cheezburger? a nonparanormal approach tocombining textual and visual information for pre-dicting and generating popular meme descriptions.In NAACL.

Zhou Yu, Leah Nicolich-Henkin, Alan W Black, andAlex I Rudnicky. 2016. A wizard-of-oz study on anon-task-oriented dialog systems that reacts to userengagement. In 17th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue. page 55.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of the IEEE In-ternational Conference on Computer Vision. pages19–27.

Arnold Zwicky and Elizabeth Zwicky. 1986. Imper-fect puns, markedness, and phonological similarity:With fronds like these, who needs anemones. FoliaLinguistica 20(3-4):493–503.

775

punny captions: witty wordplay in image descriptionsproceedings of naacl-hlt 2018, pages 770–775...

Documents