1 2 f [email protected], [email protected] arxiv:1910.06737v1 [cs.cv… · 2019-10-16 · 2017....

Integrating Temporal and Spatial Attentions for VATEX Video CaptioningChallenge 2019

Shizhe Chen1, Yida Zhao1, Yuqing Song1, Qin Jin1, Qi Wu2

1 Renmin University of China, 2 University of Adelaide{cszhe1, zyiday, syuqing, qjin}@ruc.edu.cn, [email protected]

Abstract

This notebook paper presents our model in the VATEXvideo captioning challenge. In order to capture multi-levelaspects in the video, we propose to integrate both tempo-ral and spatial attentions for video captioning. The tempo-ral attentive module focuses on global action movementswhile spatial attentive module enables to describe morefine-grained objects. Considering these two types of at-tentive modules are complementary, we thus fuse them viaa late fusion strategy. The proposed model significantlyoutperforms baselines and achieves 73.4 CIDEr score onthe testing set which ranks the second place at the VATEXvideo captioning challenge leaderboard 2019.

1. Introduction

Video captioning is a complicated task which requiresto recognize multiple semantic aspects in the video such asscenes, objects, actions etc. and generate a sentence to de-scribe such semantic contents. Therefore, it is importantto capture multi-level details from global events to localobjects in the video in order to generate an accurate andcomprehensive video description. However, state-of-the-artvideo captioning models [1, 2] mainly utilize overall videorepresentations or temporal attention [3] over short videosegments to generate video description, which lack detailsat the spatial object level and thus are prone to miss or pre-dict inaccurate details in the video.

In this work, we propose to integrate spatial object levelinformation and temporal level information for video cap-tioning. The temporal attention is employed to aggregateaction movements in the video; while the spatial attentionenables the model to ground on fine-grained objects in cap-tion generation. Since the temporal and spatial informa-tion are complementary, we utilize a late fusion strategy tocombine captions generated from the two types of attentionmodels. Our proposed model achieves consistent improve-ments over baselines with CIDEr of 88.4 on the VATEX

validation set and 73.4 on the testing set, which wins thesecond place in the VATEX challenge leaderboard 2019.

2. Video Captioning SystemOur video captioning system consists of three modules,

including video encoder to extract global, temporal and spa-tial features from the video, language decoder that employspatial and temporal attentions respectively to generate sen-tences, and temporal-spatial fusion module to integrate gen-erated sentences from different attentive captioning models.

2.1. Video Encoding

Global Video Representation. In order to comprehen-sively encode videos as global representation, we extractmulti-modal features from three modalities including im-age, motion and audio. For the image modality, we uti-lize the resnext101 [4] pretrained on the Imagenet to extractglobal image features every 32 frames and apply averagepooling on the temporal dimension; for the motion modal-ity, we utilize the ir-csn model [5] pretrained on Kinetics400 to extract video segment features every 32 frames fol-lowed by average pooling; for the audio modality, we uti-lize VGGish network [6] pretrained on Youtube8M to ex-tract acoustic features. We concatenate all the three featuresand employ linear transformation to obtain the global multi-modal video representation v.

Temporal Branch. In order to capture action move-ments, we represent the video as a sequence of consecutivesegment-level features in the temporal branch. Since theimage and motion modality features can be well aligned inthe temporal dimension, we concatenate the image featureand the motion feature every 32 frames as the segment-levelfeature, and then apply linear transformation to obtain thefused embedding for each segment as V t = {vti , · · · , vtn}.

Spatial Branch. In order to capture fine-grained objectdetails, we employ a Mask R-CNN [7] pretrained onMSCOCO to detect objects in the video every 32 frames. At

1

arX

iv:1

910.

0673

7v1

[cs

.CV

] 1

5 O

ct 2

019

Table 1. Captioning performance on VATEX validation and testing set.

Validation TestingRL BLEU@4 METEOR ROUGE CIDEr BLEU@4 METEOR ROUGE CIDEr

vanilla 36.7 25.5 52.1 72.5 - - - -

temporal attention 38.4 26.5 53.2 78.5 35.7 25.1 51.5 65.3spatial attention 38.4 26.3 53.2 77.8 36.7 25.3 52.1 66.5

temporal attention X 41.4 26.8 54.6 85.4 38.4 25.3 52.8 70.9spatial attention X 41.0 26.9 54.6 85.4 - - - -

temporal + spatial X 42.2 27.4 55.2 88.8 39.1 25.8 53.3 73.4

most 10 objects are kept for each frame after NMS. We uti-lize ROI align [7] to generate object-level features from fea-ture maps of above mentioned image and motion networksand encode features with linear transformation to generatespatial embeddings as V o = {vo1, · · · , vom}.

2.2. Language Decoding

We utilize a two-layer LSTM [8] as the language de-coder. The first layer is an attention LSTM, which aims togenerate a query vector h1t for attention via collecting nec-essary contexts from global video representation v, previousword embedding wt−1 and previous output of the second-layer LSTM h2t−1 as follows:

h1t = LSTM([v;wt−1;h2t−1], h1t−1) (1)

Then given the attended memories V x, x ∈ [t, o], the queryvector h1t dynamically fuses relevant features in V x as ctvia the attention mechanism:

αt,· = softmax(wTc tanh(Wvcv

xt,· +Whch

1t )) (2)

ct =∑

i αt,ivxi (3)

The contextual feature ct will be fed into the second-layerLSTM to predict words where yt is the target word at thet-th step:

h2t = LSTM([ct;h1t ], h2t−1) (4)

p(yt|y<t) = softmax(Wyh2t + by) (5)

We employ cross entropy loss to train the video encoder andlanguage decoder. In order to further boost captioning per-formance in terms of evaluation metrics, we also fine-tunethe model with reinforcement learning (RL) [9]. Specifi-cally, we utilize CIDEr and BLEU scores as reward in RLand combine it with cross entropy loss for training.

2.3. Temporal-Spatial Fusion

The temporal attentive model and spatial attentive modelare complementary with each other since they focus on dif-ferent aspects in the video. Therefore, we utilize a late fu-sion strategy to fuse results from the two types of attentive

Figure 1. Examples on validation set with generated captions fromtemporal attentive model and spatial attentive model. The spatialattentive model is better at describing fine-grained object while thetemporal attentive model can describe events more accurately.

models. We train a video-semantic embedding model [10]on the VATEX dataset, and use it to select the best video de-scriptions among generated captions from the two types ofattentive models according to their relevancy to the video.

3. ExperimentsFollowing the challenge policy, we only employ the VA-

TEX dataset [11] for training. Since some videos are un-available to download, our training, validation and testingset contain 25,442, 2,933 and 5,781 videos respectively. Tosubmit results on the server which requires predictions onfull testing set, we further train models using the providedi3d features to generate captions for unavailable videos.

Table 1 represents captioning performance of differentcaptioning models on VATEX dataset. The vanilla modelonly utilizes the global video representation without any at-tention mechanism, which is inferior to temporal and spa-tial attentive models. We can see that the temporal andspatial attentions are comparable on the validation set, butthe spatial attention achieves slightly better performanceon the testing set. Since the testing videos might containzero-/few-shot actions, the spatial attention to attend objects

2

might be more generalizable on those videos than tempo-ral attention that focus on action movements. In Figure 1,we show some generated captions from temporal and spa-tial models for videos on the validation set. We can seethat the captions from temporal and spatial models are di-verse which focus on different aspects of videos, for ex-ample, the spatial attentive model can generate descriptionsabout small objects in the video while the temporal atten-tive model tends to emphasize on global event. Therefore,it is beneficial to combine the two types of models. Afterfusing temporal and spatial attentive models via our late fu-sion strategy, we achieve the best performance as shown inTable 1.

4. ConclusionIn the VATEX Challenge 2019, we mainly focus on in-

tegrating temporal and spatial attentions for video caption-ing, which are complementary for comprehensive video de-scription generation. In the future, we will explore more ef-fective methods for spatial-temporal reasoning and fusion.Besides, we will improve the generalization of our mod-els and reduce the performance gap between frequent andfew-/zero-shot action videos, such as using stronger videofeatures pretrained on larger dataset like Kinetics 600 andensemble of different video captioning models.

References[1] Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang

Zeng, Bei Liu, Jianlong Fu, and Alexander Hauptmann. Ac-tivitynet 2019 task 3: Exploring contexts for dense caption-ing events in videos. arXiv preprint arXiv:1907.05092, 2019.

[2] Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bo-hyung Han. Streamlined dense video captioning. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 6588–6597, 2019.

[3] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,Christopher Pal, Hugo Larochelle, and Aaron Courville. De-scribing videos by exploiting temporal structure. In Proceed-ings of the IEEE international conference on computer vi-sion, pages 4507–4515, 2015.

[4] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1492–1500,2017.

[5] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis-zli. Video classification with channel-separated convolu-tional networks. arXiv preprint arXiv:1904.02811, 2019.

[6] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort FGemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn archi-tectures for large-scale audio classification. In 2017 ieee in-ternational conference on acoustics, speech and signal pro-cessing (icassp), pages 131–135. IEEE, 2017.

[7] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017.

[8] Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and Lei Zhang.Bottom-up and top-down attention for image captioning andvisual question answering. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages6077–6086, 2018.

[9] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, JarretRoss, and Vaibhava Goel. Self-critical sequence training forimage captioning. In CVPR, volume 1, page 3, 2017.

[10] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and SanjaFidler. Vse++: Improved visual-semantic embeddings. arXivpreprint arXiv:1707.05612, 2(7):8, 2017.

[11] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-FangWang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research.arXiv preprint arXiv:1904.03493, 2019.

3

1 2 f [email protected], [email protected] arxiv:1910.06737v1 [cs.cv… · 2019-10-16 · 2017....

Documents