no metrics are perfect: adversarial reward learning for ...visual storytelling visual storytelling...

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 899–909Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

899

No Metrics Are Perfect:Adversarial Reward Learning for Visual Storytelling

Xin Wang∗, Wenhu Chen∗, Yuan-Fang Wang , William Yang WangUniversity of California, Santa Barbara

{xwang,wenhuchen,yfwang,william}@cs.ucsb.edu

Abstract

Though impressive results have beenachieved in visual captioning, the taskof generating abstract stories from photostreams is still a little-tapped problem.Different from captions, stories have moreexpressive language styles and containmany imaginary concepts that do not ap-pear in the images. Thus it poses chal-lenges to behavioral cloning algorithms.Furthermore, due to the limitations of au-tomatic metrics on evaluating story qual-ity, reinforcement learning methods withhand-crafted rewards also face difficul-ties in gaining an overall performanceboost. Therefore, we propose an Adver-sarial REward Learning (AREL) frame-work to learn an implicit reward functionfrom human demonstrations, and then op-timize policy search with the learned re-ward function. Though automatic eval-uation indicates slight performance boostover state-of-the-art (SOTA) methods incloning expert behaviors, human evalua-tion shows that our approach achieves sig-nificant improvement in generating morehuman-like stories than SOTA systems.Code will be made available here1.

1 Introduction

Recently, increasing attention has been focused onvisual captioning (Chen et al., 2015; Xu et al.,2016; Wang et al., 2018c), which aims at describ-ing the content of an image or a video. Though ithas achieved impressive results, its capability ofperforming human-like understanding is still re-strictive. To further investigate machine’s capa-

∗ Equal contribution1https://github.com/littlekobe/AREL

Story#1:Thebrotherandsisterwereready forthefirstdayofschool.Theywereexcited togototheirfirstdayandmeetnewfriends.Theytoldtheirmom howhappytheywere.Theysaidtheyweregoingtomakealotofnewfriends.Thentheygotupandgotready togetinthecar .Story#2:Thebrother didnotwanttotalktohissister.Thesiblingsmadeup.Theystartedtotalkandsmile.Theirparents showedup.Theywerehappy toseethem.

(a) (b) (c) (d) (e)Captions:(a)Asmallboyandagirlaresittingtogether.(b)Twokidssittingonaporchwiththeirbackpackson.(c)Twoyoungkidswithbackpackssittingontheporch.(d)Twoyoungchildrenthatareveryclosetooneanother.(e)Aboyandagirlsmilingatthecameratogether.

Figure 1: An example of visual storytelling andvisual captioning. Both captions and stories areshown here: each image is captioned with one sen-tence, and we also demonstrate two diversified sto-ries that match the same image sequence.

bilities in understanding more complicated visualscenarios and composing more structured expres-sions, visual storytelling (Huang et al., 2016) hasbeen proposed. Visual captioning is aimed at de-picting the concrete content of the images, and itsexpression style is rather simple. In contrast, vi-sual storytelling goes one step further: it summa-rizes the idea of a photo stream and tells a storyabout it. Figure 1 shows an example of visualcaptioning and visual storytelling. We have ob-served that stories contain rich emotions (excited,happy, not want) and imagination (siblings, par-ents, school, car). It, therefore, requires the capa-bility to associate with concepts that do not explic-itly appear in the images. Moreover, stories aremore subjective, so there barely exists standard

https://github.com/littlekobe/AREL

900

templates for storytelling. As shown in Figure 1,the same photo stream can be paired with diversestories, different from each other. This heavily in-creases the evaluation difficulty.

So far, prior work for visual storytelling (Huanget al., 2016; Yu et al., 2017b) is mainly inspiredby the success of visual captioning. Nevertheless,because these methods are trained by maximizingthe likelihood of the observed data pairs, they arerestricted to generate simple and plain descriptionwith limited expressive patterns. In order to copewith the challenges and produce more human-likedescriptions, Rennie et al. (2016) have proposeda reinforcement learning framework. However, inthe scenario of visual storytelling, the common re-inforced captioning methods are facing great chal-lenges since the hand-crafted rewards based onstring matches are either too biased or too sparseto drive the policy search. For instance, we usedthe METEOR (Banerjee and Lavie, 2005) scoreas the reward to reinforce our policy and foundthat though the METEOR score is significantlyimproved, the other scores are severely harmed.Here we showcase an adversarial example with anaverage METEOR score as high as 40.2:

We had a great time to have a lot of the.They were to be a of the. They were to be inthe. The and it were to be the. The, and itwere to be the.

Apparently, the machine is gaming the metrics.Conversely, when using some other metrics (e.g.BLEU, CIDEr) to evaluate the stories, we observean opposite behavior: many relevant and coherentstories are receiving a very low score (nearly zero).

In order to resolve the strong bias brought bythe hand-coded evaluation metrics in RL trainingand produce more human-like stories, we proposean Adversarial REward Learning (AREL) frame-work for visual storytelling. We draw our inspi-ration from recent progress in inverse reinforce-ment learning (Ho and Ermon, 2016; Finn et al.,2016; Fu et al., 2017) and propose the AREL algo-rithm to learn a more intelligent reward function.Specifically, we first incorporate a Boltzmann dis-tribution to associate reward learning with distri-bution approximation, then design the adversarialprocess with two models – a policy model and areward model. The policy model performs theprimitive actions and produces the story sequence,while the reward model is responsible for learning

the implicit reward function from human demon-strations. The learned reward function would beemployed to optimize the policy in return.

For evaluation, we conduct both automatic met-rics and human evaluation but observe a poor cor-relation between them. Particularly, our methodgains slight performance boost over the base-line systems on automatic metrics; human evalu-ation, however, indicates significant performanceboost. Thus we further discuss the limitationsof the metrics and validate the superiority of ourAREL method in performing more intelligent un-derstanding of the visual scenes and generatingmore human-like stories.

Our main contributions are four-fold:

• We propose an adversarial reward learningframework and apply it to boost visual storygeneration.

• We evaluate our approach on the VisualStorytelling (VIST) dataset and achieve thestate-of-the-art results on automatic metrics.

• We empirically demonstrate that automaticmetrics are not perfect for either training orevaluation.

• We design and perform a comprehensivehuman evaluation via Amazon MechanicalTurk, which demonstrates the superiority ofthe generated stories of our method on rele-vance, expressiveness, and concreteness.

2 Related Work

Visual Storytelling Visual storytelling is thetask of generating a narrative story from a photostream, which requires a deeper understandingof the event flow in the stream. Park and Kim(2015) has done some pioneering research on sto-rytelling. Chen et al. (2017) proposed a multi-modal approach for storyline generation to pro-duce a stream of entities instead of human-like de-scriptions. Recently, a more sophisticated datasetfor visual storytelling (VIST) has been releasedto explore a more human-like understanding ofgrounded stories (Huang et al., 2016). Yu et al.(2017b) proposes a multi-task learning algorithmfor both album summarization and paragraph gen-eration, achieving the best results on the VISTdataset. But these methods are still based on be-havioral cloning and lack the ability to generatemore structured stories.

901

Reinforcement Learning in Sequence Genera-tion Recently, reinforcement learning (RL) hasgained its popularity in many sequence generationtasks such as machine translation (Bahdanau et al.,2016), visual captioning (Ren et al., 2017; Wanget al., 2018b), summarization (Paulus et al., 2017;Chen et al., 2018), etc. The common wisdom ofusing RL is to view generating a word as an ac-tion and aim at maximizing the expected returnby optimizing its policy. As pointed in (Ranzatoet al., 2015), traditional maximum likelihood al-gorithm is prone to exposure bias and label bias,while the RL agent exposes the generative modelto its own distribution and thus can perform bet-ter. But these works usually utilize hand-craftedmetric scores as the reward to optimize the model,which fails to learn more implicit semantics due tothe limitations of automatic metrics.

Rethinking Automatic Metrics Automaticmetrics, including BLEU (Papineni et al., 2002),CIDEr (Vedantam et al., 2015), METEOR (Baner-jee and Lavie, 2005), and ROUGE (Lin, 2004),have been widely applied to the sequence gener-ation tasks. Using automatic metrics can ensurerapid prototyping and testing new models withfewer expensive human evaluation. However, theyhave been criticized to be biased and correlatepoorly with human judgments, especially in manygenerative tasks like response generation (Loweet al., 2017; Liu et al., 2016), dialogue sys-tem (Bruni and Fernandez, 2017) and machinetranslation (Callison-Burch et al., 2006). Thenaive overlap-counting methods are not ableto reflect many semantic properties in naturallanguage, such as coherence, expressiveness, etc.

Generative Adversarial Network Generativeadversarial network (GAN) (Goodfellow et al.,2014) is a very popular approach for estimatingintractable probabilities, which sidestep the diffi-culty by alternately training two models to play amin-max two-player game:

minD

maxG

Ex∼pdata

[logD(x)] + Ez∼pz

[logD(G(z))] ,

where G is the generator and D is the discrimina-tor, and z is the latent variable. Recently, GANhas quickly been adopted to tackle discrete prob-lems (Yu et al., 2017a; Dai et al., 2017; Wang et al.,2018a). The basic idea is to use Monte Carlo pol-icy gradient estimation (Williams, 1992) to updatethe parameters of the generator.

AdversarialObjective Reward Model Policy Model

Environment

Reward

InverseRL

RL

Images references

SampledStory

Images

Figure 2: AREL framework for visual storytelling.

Inverse Reinforcement Learning Reinforce-ment learning is known to be hindered by theneed for an extensive feature and reward engi-neering, especially under the unknown dynamics.Therefore, inverse reinforcement learning (IRL)has been proposed to infer expert’s reward func-tion. Previous IRL approaches include maximummargin approaches (Abbeel and Ng, 2004; Ratliffet al., 2006) and probabilistic approaches (Ziebart,2010; Ziebart et al., 2008). Recently, adversarialinverse reinforcement learning methods providean efficient and scalable promise for automatic re-ward acquisition (Ho and Ermon, 2016; Finn et al.,2016; Fu et al., 2017; Henderson et al., 2017).These approaches utilize the connection betweenIRL and energy-based model and associate everydata with a scalar energy value by using Boltz-mann distribution pθ(x) ∝ exp(−Eθ(x)). In-spired by these methods, we propose a practicalAREL approach for visual storytelling to uncovera robust reward function from human demonstra-tions and thus help produce human-like stories.

3 Our Approach

3.1 Problem Statement

Here we consider the task of visual storytelling,whose objective is to output a word sequenceW =(w1, w1, · · · , wT ), wt ∈ V given an input imagestream of 5 ordered images I = (I1, I2, · · · , I5),where V is the vocabulary of all output token.We formulate the generation as a markov deci-sion process and design a reinforcement learningframework to tackle it. As described in Figure 2,our AREL framework is mainly composed of twomodules: a policy model πβ(W ) and a rewardmodel Rθ(W ). The policy model takes an imagesequence I as the input and performs sequentialactions (choosing wordsw from the vocabulary V)to form a narrative story W . The reward model

902

CNN

Mybrotherrecentlygraduatedcollege.

Itwasaformalcapandgownevent.

Mymomanddadattended.

Later,myauntandgrandmashowedup.

Whentheeventwasoverheevengotcongratulatedbythemascot.

Encoder Decoder

Figure 3: Overview of the policy model. The vi-sual encoder is a bidirectional GRU, which en-codes the high-level visual features extracted fromthe input images. Its outputs are then fed into theRNN decoders to generate sentences in parallel.Finally, we concatenate all the generated sentencesas a full story. Note that the five decoders share thesame weights.

is optimized by the adversarial objective (see Sec-tion 3.3) and aims at deriving a human-like rewardfrom both human-annotated stories and sampledpredictions.

3.2 ModelPolicy Model As is shown in Figure 3, the pol-icy model is a CNN-RNN architecture. We fistfeed the photo stream I = (I1, · · · , I5) into apretrained CNN and extract their high-level imagefeatures. We then employ a visual encoder to fur-ther encode the image features as context vectorshi = [

←−hi ;−→hi ]. The visual encoder is a bidirectional

gated recurrent units (GRU).In the decoding stage, we feed each context vec-

tor hi into a GRU-RNN decoder to generate a sub-story Wi. Formally, the generation process can bewritten as:

sit = GRU(sit−1, [wit−1, hi]) , (1)

πβ(wit|wi1:t−1) = softmax(Wss

it + bs) , (2)

where sit denotes the t-th hidden state of i-th de-coder. We concatenate the previous token wit−1and the context vector hi as the input. Ws andbs are the projection matrix and bias, which out-put a probability distribution over the whole vo-cabulary V. Eventually, the final story W is theconcatenation of the sub-stories Wi. β denotes allthe parameters of the encoder, the decoder, and theoutput layer.

Story Convolution FClayerPooling

CNN

mymomanddad

attended.

<EOS>

+

Reward

Figure 4: Overview of the reward model. Our re-ward model is a CNN-based architecture, whichutilizes convolution kernels with size 2, 3 and 4to extract bigram, trigram and 4-gram representa-tions from the input sequence embeddings. Oncethe sentence representation is learned, it will beconcatenated with the visual representation of theinput image, and then be fed into the final FC layerto obtain the reward.

Reward Model The reward model Rθ(W ) is aCNN-based architecture (see Figure 4). Instead ofgiving an overall score for the whole story, we ap-ply the reward model to different story parts (sub-stories) Wi and compute partial rewards, wherei = 1, · · · , 5. We observe that the partial rewardsare more fine-grained and can provide better guid-ance for the policy model.

We first query the word embeddings of the sub-story (one sentence in most cases). Next, multi-ple convolutional layers with different kernel sizesare used to extract the n-grams features, whichare then projected into the sentence-level repre-sentation space by pooling layers (the design hereis inspired by Kim (2014)). In addition to thetextual features, evaluating the quality of a storyshould also consider the image features for rele-vance. Therefore, we then combine the sentencerepresentation with the visual feature of the inputimage through concatenation and feed them intothe final fully connected decision layer. In theend, the reward model outputs an estimated rewardvalue Rθ(W ). The process can be written in for-mula:

Rθ(W ) =Wr(fconv(W ) +WiICNN ) + br, (3)

where Wr, br denotes the weights in the outputlayer, and fconv denotes the operations in CNN.ICNN is the high-level visual feature extractedfrom the image, and Wi projects it into the sen-tence representation space. θ includes all the pa-

903

rameters above.

3.3 Learning

Reward Boltzmann Distribution In order toassociate story distribution with reward function,we apply EBM to define a Reward Boltzmann dis-tribution:

pθ(W ) =exp(Rθ(W ))

Zθ, (4)

Where W is the word sequence of the story andpθ(W ) is the approximate data distribution, andZθ =

∑W

exp(Rθ(W )) denotes the partition func-

tion. According to the energy-based model (Le-Cun et al., 2006), the optimal reward functionR∗(W ) is achieved when the Reward-Boltzmanndistribution equals to the “real” data distributionpθ(W ) = p∗(W ).

Adversarial Reward Learning We first intro-duce an empirical distribution pe(W ) = 1(W∈D)

|D|to represent the empirical distribution of the train-ing data, whereD denotes the dataset with |D| sto-ries and 1 denotes an indicator function. We usethis empirical distribution as the “good” examples,which provides the evidence for the reward func-tion to learn from.

In order to approximate the Reward Boltzmanndistribution towards the “real” data distributionp∗(W ), we design a min-max two-player game,where the Reward Boltzmann distribution pθ aimsat maximizing the its similarity with empiricaldistribution pe while minimizing that with the“faked” data generated from policy model πβ . Onthe contrary, the policy distribution πβ tries tomaximize its similarity with the Boltzmann dis-tribution pθ. Formally, the adversarial objectivefunction is defined as

maxβ

minθKL(pe(W )||pθ(W ))−KL(πβ(W )||pθ(W )) .

(5)

We further decompose it into two parts. First,because the objective Jβ of the story genera-tion policy is to minimize its similarity with theBoltzmann distribution pθ, the optimal policythat minimizes KL-divergence is thus π(W ) ∼exp(Rθ(W )), meaning if Rθ is optimal, the op-timal πβ = π∗. In formula,

Jβ =−KL(πβ(W )||pθ(W ))

= EW∼πβ(W )

[Rθ(W )] +H(πβ(W )) , (6)

Algorithm 1 The AREL Algorithm.1: for episode← 1 to N do2: collect story W by executing policy πθ3: if Train-Reward then4: θ ← θ − η × ∂Jθ

∂θ (see Equation 9)5: else if Train-Policy then6: collect story W from empirical pe7: β ← β − η × ∂Jβ

∂β (see Equation 9)8: end if9: end for

where H denotes the entropy of the policy model.On the other hand, the objective Jθ of the re-ward function is to distinguish between human-annotated stories and machine-generated stories.Hence it is trying to minimize the KL-divergencewith the empirical distribution pe and maximizethe KL-divergence with the approximated policydistribution πβ:

Jθ =KL(pe(W )||pθ(W ))−KL(πβ(W )||pθ(W ))

=∑W

[pe(W )Rθ(W )− πβ(W )Rθ(W )]

−H(pe) +H(πβ) ,

(7)

SinceH(πβ) andH(pe) are irrelevant to θ, we de-note them as constant C. Therefore, the objectiveJθ can be further derived as

Jθ = EW∼pe(W )

[Rθ(W )]− EW∼πβ(W )

[Rθ(W )] + C . (8)

Here we propose to use stochastic gradient de-scent to optimize these two models alternately.Formally, the gradients can be written as

∂Jθ∂θ

= EW∼pe(W )

∂Rθ(W )

∂θ− EW∼πβ(W )

∂Rθ(W )

∂θ,

∂Jβ∂β

= EW∼πβ(W )

(Rθ(W ) + log πθ(W )− b)∂ log πβ(W )

∂β,

(9)

where b is the estimated baseline to reduce thevariance.

Training & Testing As described in Algo-rithm 1, we introduce an alternating algorithm totrain these two models using stochastic gradientdescent. During testing, the policy model is usedwith beam search to produce the story.

4 Experiments and Analysis

4.1 Experimental SetupVIST Dataset The VIST dataset (Huang et al.,2016) is the first dataset for sequential vision-to-language tasks including visual storytelling, which

904

consists of 10,117 Flickr albums with 210,819unique photos. In this paper, we mainly evalu-ate our AREL method on this dataset. After filter-ing the broken images2, there are 40,098 training,4,988 validation, and 5,050 testing samples. Eachsample contains one story that describes 5 selectedimages from a photo album (mostly one sentenceper image). And the same album is paired with 5different stories as references. In our experiments,we used the same split settings as in (Huang et al.,2016; Yu et al., 2017b) for a fair comparison.

Evaluation Metrics In order to comprehen-sively evaluate our method on storytelling dataset,we adopted both the automatic metrics and humanevaluation as our criterion. Four diverse automaticmetrics were used in our experiments: BLEU,METEOR, ROUGE-L, and CIDEr. We utilizedthe open source evaluation code3 used in (Yu et al.,2017b). For human evaluation, we employed theAmazon Mechanical Turk to perform two kinds ofuser studies (see Section 4.3 for more details).

Training Details We employ pretrainedResNet-152 model (He et al., 2016) to extractimage features from the photo stream. We built avocabulary of size 9,837 to include words appear-ing more than three times in the training set. Moretraining details can be found at Appendix B.

4.2 Automatic EvaluationIn this section, we compare our AREL methodwith the state-of-the-art methods as well as stan-dard reinforcement learning algorithms on auto-matic evaluation metrics. Then we further discussthe limitations of the hand-crafted metrics on eval-uating human-like stories.

Comparison with SOTA on Automatic MetricsIn Table 1, we compare our method with Huanget al. (2016) and Yu et al. (2017b), which reportachieving best-known results on the VIST dataset.We first implement a strong baseline model (XE-ss), which share the same architecture with ourpolicy model but is trained with cross-entropy lossand scheduled sampling. Besides, we adopt thetraditional generative adversarial training for com-parison (GAN). As shown in Table 1, our XE-ss model already outperforms the best-known re-

2There are only 3 (out of 21,075) broken images in thetest set, which basically has no influence on the final results.Moreover, Yu et al. (2017b) also removed the 3 pictures, so itis a fair comparison.

3https://github.com/lichengunc/vist_eval

Method B-1 B-2 B-3 B-4 M R C

Huang et al. - - - - 31.4 - -Yu et al. - - 21.0 - 34.1 29.5 7.5XE-ss 62.3 38.2 22.5 13.7 34.8 29.7 8.7GAN 62.8 38.8 23.0 14.0 35.0 29.5 9.0AREL-s-50 63.8 38.9 22.9 13.8 34.9 29.4 9.5AREL-t-50 63.4 39.0 23.1 14.1 35.2 29.6 9.5AREL-s-100 63.9 39.1 23.0 13.9 35.0 29.7 9.6AREL-t-100 63.8 39.1 23.2 14.1 35.0 29.5 9.4

Table 1: Automatic evaluation on the VISTdataset. We report BLEU (B), METEOR (M),ROUGH-L (R), and CIDEr (C) scores of theSOTA systems and the models we implemented,including XE-ss, GAN and AREL. AREL-s-N de-notes AREL models with sigmoid as output acti-vation and alternate frequency as N, while AREL-t-N denoting AREL models with tahn as the outputactivation (N = 50 or 100).

sults on the VIST dataset, and the GAN model canbring a performance boost. We then use the XE-ss model to initialize our policy model and furthertrain it with AREL. Evidently, our AREL modelperforms the best and achieves the new state-of-the-art results across all metrics.

But, compared with the XE-ss model, the per-formance gain is minor, especially on METEORand ROUGE-L scores. However, in Sec. 4.3, theextensive human evaluation has indicated that ourAREL framework brings a significant improve-ment on generating human-like stories over theXE-ss model. The inconsistency of automaticevaluation and human evaluation lead to a suspectthat these hand-crafted metrics lack the ability tofully evaluate stories’ quality due to the compli-cated characteristics of the stories. Therefore, weconduct experiments to analyze and discuss thedefects of the automatic metrics in section 4.2.

Limitations of Automatic Metrics As weclaimed in the introduction, string-match-basedautomatic metrics are not perfect and fail to eval-uate some semantic characteristics of the stories,like the expressiveness and coherence of the sto-ries. In order to confirm our conjecture, we uti-lize automatic metrics as rewards to reinforce thevisual storytelling model by adopting policy gra-dient with baseline to train the policy model. Thequantitative results are demonstrated in Table 1.

Apparently, METEOR-RL and ROUGE-RL areseverely ill-posed: they obtain the highest scoreson their own metrics but damage the other met-

https://github.com/lichengunc/vist_eval

905

Method B-1 B-2 B-3 B-4 M R C

XE-ss 62.3 38.2 22.5 13.7 34.8 29.7 8.7BLEU-RL 62.1 38.0 22.6 13.9 34.6 29.0 8.9METEOR-RL 68.1 35.0 15.4 6.8 40.2 30.0 1.2ROUGE-RL 58.1 18.5 1.6 0 27.0 33.8 0CIDEr-RL 61.9 37.8 22.5 13.8 34.9 29.7 8.1AREL (avg) 63.7 39.0 23.1 14.0 35.0 29.6 9.5

Table 2: Comparison with different RL mod-els with different metric scores as the rewards.We report the average scores of the AREL mod-els as AREL (avg). Although METEOR-RLand ROUGE-RL models achieve very high scoreson their own metrics, the underlined scores areseverely damaged. Actually, they are gaming theirown metrics with nonsense sentences.

rics severely. We observe that these models areactually overfitting to a given metric while losingthe overall coherence and semantical correctness.Same as METEOR score, there is also an adver-sarial example for ROUGE-L4, which is nonsensebut achieves an average ROUGE-L score of 33.8.

Besides, as can be seen in Table 1, after rein-forced training, BLEU-RL and CIDEr-RL do notbring a consistent improvement over the XE-ssmodel. We plot the histogram distributions of bothBLEU-3 and CIDEr scores on the test set in Fig-ure 5. An interesting fact is that there are a largenumber of samples with nearly zero score on bothmetrics. However, we observed those “zero-score”samples are not pointless results; instead, lots ofthem make sense and deserve a better score thanzero. Here is a “zero-score” example on BLEU-3:

I had a great time at the restaurant today.The food was delicious. I had a lot of food.The food was delicious. T had a great time.

The corresponding reference is

The table of food was a pleasure to see!Our food is both nutritious and beautiful!Our chicken was especially tasty! We lovegreens as they taste great and are healthy!The fruit was a colorful display that tanta-lized our palette..

Although the prediction is not as good as the ref-erence, it is actually coherent and relevant to the

4An adversarial example for ROUGE-L: we the was a .and to the . we the was a . and to the . we the was a . and tothe . we the was a . and to the . we the was a . and to the .

Method Win Lose UnsureXE-ss 22.4% 71.7% 5.9%BLEU-RL 23.4% 67.9% 8.7%CIDEr-RL 13.8% 80.3% 5.9%GAN 34.3% 60.5% 5.2%AREL 38.4% 54.2% 7.4%

Table 3: Turing test results.

theme “food and eating”, which showcases the de-feats of using BLEU and CIDEr scores as a rewardfor RL training.

Moreover, we compare the human evaluationscores with these two metric scores in Figure 5.Noticeably, both BLEU-3 and CIDEr have a poorcorrelation with the human evaluation scores.Their distributions are more biased and thus can-not fully reflect the quality of the generated sto-ries. In terms of BLEU, it is extremely hard formachines to produce the exact 3-gram or 4-grammatching, so the scores are too low to provide use-ful guidance. CIDEr measures the similarity of asentence to the majority of the references. How-ever, the references to the same image sequenceare photostream different from each other, so thescore is very low and not suitable for this task. Incontrast, our AREL framework can lean a morerobust reward function from human-annotated sto-ries, which is able to provide better guidance tothe policy and thus improves its performances overdifferent metrics.

Comparison with GAN We here compare ourmethod with traditional GAN (Goodfellow et al.,2014), the update rule for generator can be gener-ally classified into two categories. We demonstratetheir corresponding objectives and ours as follows:

GAN1 : Jβ = EW∼pβ

[− logRθ(W )] ,

GAN2 : Jβ = EW∼pβ

[log(1−Rθ(W ))] ,

ours : Jβ = EW∼pβ

[−Rθ(W )] .

As discussed in Arjovsky et al. (2017), GAN1 isprone to the unstable gradient issue and GAN2is prone to the vanishing gradient issue. Analyti-cally, our method does not suffer from these twocommon issues and thus is able converge to op-timum solutions more easily. From Table 1, wecan observe slight gains of using AREL over GAN

906

Figure 5: Metric score distributions. We plot the histogram distributions of BLEU-3 and CIDEr scores onthe test set, as well as the human evaluation score distribution on the test samples. For a fair comparison,we use the Turing test results to calculate the human evaluation scores (see Section 4.3). Basically, 0.2score is given if the generated story wins the Turing test, 0.1 for tie, and 0 if losing. Each sample has 5scores from 5 judges, and we use the sum as the human evaluation score, so it is in the range [0, 1].

AREL vs XE-ss AREL vs BLEU-RL AREL vs CIDEr-RL AREL vs GANChoice (%) AREL XE-ss Tie AREL BLEU-RL Tie AREL CIDEr-RL Tie AREL GAN TieRelevance 61.7 25.1 13.2 55.8 27.9 16.3 56.1 28.2 15.7 52.9 35.8 11.3Expressiveness 66.1 18.8 15.1 59.1 26.4 14.5 59.1 26.6 14.3 48.5 32.2 19.3Concreteness 63.9 20.3 15.8 60.1 26.3 13.6 59.5 24.6 15.9 49.8 35.8 14.4

Table 4: Pairwise human comparisons. The results indicate the consistent superiority of our AREL modelin generating more human-like stories than the SOTA methods.

with automatic metrics, therefore we further de-ploy human evaluation for a better comparison.

4.3 Human EvaluationAutomatic metrics cannot fully evaluate the ca-pability of our AREL method. Therefore, weperform two different kinds of human evaluationstudies on Amazon Mechanical Turk: Turing testand pairwise human evaluation. For both tasks,we use 150 stories (750 images) sampled from thetest set, each assigned to 5 workers to eliminatehuman variance. We batch six items as one assign-ment and insert an additional assignment as a san-ity check. Besides, the order of the options withineach item is shuffled to make a fair comparison.

Turing Test We first conduct five indepen-dent Turing tests for XE-ss, BLEU-RL, CIDEr-RL, GAN, and AREL models, during which theworker is given one human-annotated sample andone machine-generated sample, and needs to de-cide which is human-annotated. As shown in Ta-ble 3, our AREL model significantly outperformsall the other baseline models in the Turing test: ithas much more chances to fool AMT worker (theratio is AREL:XE-ss:BLEU-RL:CIDEr-RL:GAN= 45.8%:28.3%:32.1%:19.7%:39.5%), which con-firms the superiority of our AREL framework ingenerating human-like stories. Unlike automaticmetric evaluation, the Turing test has indicated

a much larger margin between AREL and othercompeting algorithms. Thus, we empirically con-firm that metrics are not perfect in evaluating manyimplicit semantic properties of natural language.Besides, the Turing test of our AREL model re-veals that nearly half of the workers are fooled byour machine generation, indicating a preliminarysuccess toward generating human-like stories.

Pairwise Comparison In order to have a clearcomparison with competing algorithms with re-spect to different semantic features of the sto-ries, we further perform four pairwise compar-ison tests: AREL vs XE-ss/BLEU-RL/CIDEr-RL/GAN. For each photo stream, the worker ispresented with two generated stories and asked tomake decisions from the three aspects: relevance5,expressiveness6 and concreteness7. This head-to-head compete is designed to help us understand inwhat aspect our model outperforms the competingalgorithms, which is displayed in Table 4.

Consistently on all the three comparisons, alarge majority of the AREL stories trumps thecompeting systems with respect to their relevance,

5Relevance: the story accurately describes what is hap-pening in the image sequence and covers the main objects.

6Expressiveness: coherence, grammatically and semanti-cally correct, no repetition, expressive language style.

7Concreteness: the story should narrate concretely whatis in the image rather than giving very general descriptions.

907

XE-ss Wetookatriptothemountains.

Thereweremanydifferentkindsofdifferentkinds.

Wehadagreattime. Hewasagreattime. Itwasabeautifulday.

ARELThefamilydecidedtotakeatriptothecountryside.

Thereweresomanydifferentkindsofthingstosee.

Thefamilydecidedtogoonahike. Ihadagreattime.

Attheendoftheday,wewereabletotakeapictureofthebeautifulscenery.

Human-createdStory

Wewentonahikeyesterday.

Therewerealotofstrangeplantsthere. Ihadagreattime.

Wedrankalotofwaterwhilewewerehiking.

Theviewwasspectacular.

Figure 6: Qualitative comparison example with XE-ss. The direct comparison votes (AREL:XE-ss:Tie)were 5:0:0 on Relevance, 4:0:1 on Expressiveness, and 5:0:0 on Concreteness.

expressiveness, and concreteness. Therefore, itempirically confirms that our generated stories aremore relevant to the image sequences, more coher-ent and concrete than the other algorithms, whichhowever is not explicitly reflected by the auto-matic metric evaluation.

4.4 Qualitative Analysis

Figure 6 gives a qualitative comparison examplebetween AREL and XE-ss models. Looking at theindividual sentences, it is obvious that our resultsare more grammatically and semantically correct.Then connecting the sentences together, we ob-serve that the AREL story is more coherent anddescribes the photo stream more accurately. Thus,our AREL model significantly surpasses the XE-ss model on all the three aspects of the qualitativeexample. Besides, it won the Turing test (3 out 5AMT workers think the AREL story is created bya human). In the appendix, we also show a nega-tive case that fails the Turing test.

5 Conclusion

In this paper, we not only introduce a novel ad-versarial reward learning algorithm to generatemore human-like stories given image sequences,but also empirically analyze the limitations of theautomatic metrics for story evaluation. We believethere are still lots of improvement space in thenarrative paragraph generation tasks, like how tobetter simulate human imagination to create morevivid and diversified stories.

Acknowledgment

We thank Adobe Research for supporting our lan-guage and vision research. We would also liketo thank Licheng Yu for clarifying the detailsof his paper and the anonymous reviewers fortheir thoughtful comments. This research wassponsored in part by the Army Research Labora-tory under cooperative agreements W911NF09-2-0053. The views and conclusions contained hereinare those of the authors and should not be inter-preted as representing the official policies, eitherexpressed or implied, of the Army Research Lab-oratory or the U.S. Government. The U.S. Gov-ernment is authorized to reproduce and distributereprints for Government purposes notwithstandingany copyright notice herein.

References

Pieter Abbeel and Andrew Y Ng. 2004. Apprentice-ship learning via inverse reinforcement learning. InProceedings of the twenty-first international confer-ence on Machine learning, page 1. ACM.

Martin Arjovsky, Soumith Chintala, and Leon Bot-tou. 2017. Wasserstein gan. arXiv preprintarXiv:1701.07875.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2016. An actor-criticalgorithm for sequence prediction. arXiv preprintarXiv:1607.07086.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedings

908

of the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.

Elia Bruni and Raquel Fernandez. 2017. Adversarialevaluation for open-domain dialogue generation. InProceedings of the 18th Annual SIGdial Meeting onDiscourse and Dialogue, pages 284–288.

Chris Callison-Burch, Miles Osborne, and PhilippKoehn. 2006. Re-evaluation the role of bleu in ma-chine translation research. In 11th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics.

Wenhu Chen, Guanlin Li, Shuo Ren, Shujie Liu, ZhiruiZhang, Mu Li, and Ming Zhou. 2018. Generativebridging network in neural sequence prediction. InNAACL.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325.

Zhiqian Chen, Xuchao Zhang, Arnold P. Boedihardjo,Jing Dai, and Chang-Tien Lu. 2017. Multi-modal storytelling via generative adversarial imita-tion learning. In Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelli-gence, IJCAI-17, pages 3967–3973.

Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin.2017. Towards diverse and natural image descrip-tions via a conditional gan. In The IEEE Interna-tional Conference on Computer Vision (ICCV).

Chelsea Finn, Paul Christiano, Pieter Abbeel, andSergey Levine. 2016. A connection between gen-erative adversarial networks, inverse reinforcementlearning, and energy-based models. arXiv preprintarXiv:1611.03852.

Justin Fu, Katie Luo, and Sergey Levine. 2017.Learning robust rewards with adversarial in-verse reinforcement learning. arXiv preprintarXiv:1710.11248.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon,David Meger, Joelle Pineau, and Doina Precup.2017. Optiongan: Learning joint reward-policy op-tions using generative adversarial inverse reinforce-ment learning. arXiv preprint arXiv:1709.06683.

Jonathan Ho and Stefano Ermon. 2016. Generative ad-versarial imitation learning. In Advances in NeuralInformation Processing Systems, pages 4565–4573.

Ting-Hao K. Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Jacob Devlin, Aish-warya Agrawal, Ross Girshick, Xiaodong He,Pushmeet Kohli, Dhruv Batra, et al. 2016. Visualstorytelling. In 15th Annual Conference of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2016).

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato,and F Huang. 2006. A tutorial on energy-basedlearning. Predicting structured data, 1(0).

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Chia-Wei Liu, Ryan Lowe, Iulian V Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. arXiv preprintarXiv:1603.08023.

Ryan Lowe, Michael Noseworthy, Iulian V Serban,Nicolas Angelard-Gontier, Yoshua Bengio, andJoelle Pineau. 2017. Towards an automatic turingtest: Learning to evaluate dialogue responses. arXivpreprint arXiv:1708.07149.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Cesc C Park and Gunhee Kim. 2015. Expressing animage stream with a sequence of natural sentences.In Advances in Neural Information Processing Sys-tems, pages 73–81.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732.

Nathan D Ratliff, J Andrew Bagnell, and Martin AZinkevich. 2006. Maximum margin planning. InProceedings of the 23rd international conference onMachine learning, pages 729–736. ACM.

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, andLi-Jia Li. 2017. Deep reinforcement learning-based

909

image captioning with embedding reward. In Pro-ceeding of IEEE conference on Computer Vision andPattern Recognition (CVPR).

Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. 2016. Self-criticalsequence training for image captioning. arXivpreprint arXiv:1612.00563.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recog-nition, pages 4566–4575.

Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, andTao Mei. 2018a. Show, reward and tell: Automaticgeneration of narrative paragraph from photo streamby adversarial training. AAAI.

Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang,and William Yang Wang. 2018b. Video caption-ing via hierarchical reinforcement learning. In TheIEEE Conference on Computer Vision and PatternRecognition (CVPR).

Xin Wang, Yuan-Fang Wang, and William Yang Wang.2018c. Watch, listen, and describe: Globally and lo-cally aligned cross-modal attentions for video cap-tioning. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridgingvideo and language. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017a. Seqgan: Sequence generative adversarialnets with policy gradient. In AAAI, pages 2852–2858.

Licheng Yu, Mohit Bansal, and Tamara Berg. 2017b.Hierarchically-attentive rnn for album summariza-tion and storytelling. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 966–971, Copenhagen,Denmark. Association for Computational Linguis-tics.

Brian D Ziebart. 2010. Modeling purposeful adaptivebehavior with the principle of maximum causal en-tropy. Carnegie Mellon University.

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell,and Anind K Dey. 2008. Maximum entropy inversereinforcement learning. In AAAI, volume 8, pages1433–1438. Chicago, IL, USA.

no metrics are perfect: adversarial reward learning for ...visual storytelling visual storytelling...

Documents