learned spatio-temporal adaptive pooling for video captioning · learned spatio-temporal adaptive...

16
Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Upload: others

Post on 14-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Learned Spatio-Temporal Adaptive Pooling for Video

Captioning

Danny FRANCIS and Benoit HUET

AI4TV 2019, Nice

Page 2: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Video Captioning in a NutshellVideo Captioning in a Nutshell

03/09/18 - - p 2

INPUT VIDEO

CAPTIONINGMODEL

Someone is making food

Page 3: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Video Captioning for TVVideo Captioning for TV

03/09/18 - - p 3

Annotations for impaired people

Old TV archives need to be annotated

Textual indexing

Summarization with shot detection

Page 4: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

The Encoder-Decoder Scheme for NMTThe Encoder-Decoder Scheme for NMT

03/09/18 - - p 4

Reference: Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).

Page 5: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Attention for Encoder-DecoderAttention for Encoder-Decoder

03/09/18 - - p 5

Reference: Luong, M. T., Pham, H., & Manning, C. D. Effective Approaches to Attention-based Neural Machine Translation.

Page 6: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Encoder-Decoder for Video CaptioningEncoder-Decoder for Video Captioning

03/09/18 - - p 6

Frame sequences vs word sequences

Visual features vectors vs word embeddings

The Encoder-Decoder scheme can be easily extended to Video captioning:

– Source language = Video

– Target language = unchanged

Page 7: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Problems with the Naive ApproachProblems with the Naive Approach

03/09/18 - - p 7

Visual features usually extracted from a CNN

Loss of space information

– How to relate an object with another one in a frame?

– How to relate an object with another in another frame?

InputImage

Convolutionsand Pooling

Grid of featuresvectors

GlobalPooling

GlobalFeatures

Page 8: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Our L-STAP Method (1)Our L-STAP Method (1)

03/09/18 - - p 8

Soft-Attention Pooling

Instead of the global pooling:- use an LSTM to compute local embeddings- change computation of LSTM hidden state: soft-attention pooling on local embeddings based on previous hidden state

Page 9: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Our L-STAP Method (2)Our L-STAP Method (2)

03/09/18 - - p 9

Learned Spatio-Temporal Adaptive Pooling

It is Learned: pooling depends on training data

It is Spatio-Temporal: LSTM hidden states contain temporal information based on local features

It is Adaptive: the soft-attention pooling of local embeddings makes it adaptive to input data

Page 10: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Video Captioning with L-STAPVideo Captioning with L-STAP

03/09/18 - - p 10

Page 11: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

OptimizationOptimization

03/09/18 - - p 11

Usual cross-entropy loss togenerate sentences

Make sentence embeddings andvisual embeddings match

Loss = Ld + 0.4L

m

Page 12: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Matching ComponentMatching Component

03/09/18 - - p 12

Improves results by directly matching video embeddings

with the ground-truth sentence

Page 13: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Results on MSVDResults on MSVD

03/09/18 - - p 13

Model BLEU-4 ROUGE METEOR CIDEr

TSL 51.7 - 34.0 74.9

RecNet 52.3 69.8 34.1 80.3

MGRU 53.8 - 34.5 81.2

AGHA 55.1 - 35.3 83.3

SAM 54.0 - 35.3 87.4

E2E 50.3 70.8 34.1 87.5

SibNet 54.2 71.7 34.8 88.2

Ours 55.1 72.7 35.4 86.7

Page 14: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

Ablation StudyAblation Study

03/09/18 - - p 14

Model BLEU-4 ROUGE METEOR CIDEr

Baseline 52.7 71.4 34.1 79.5

Baseline + matching

53.3 71.2 34.5 82.2

L-STAP (avg) + matching

55.1 72.3 35.4 84.3

L-STAP (attention) + matching

55,1 72,7 35,4 86,7

Page 15: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

ConclusionConclusion

03/09/18 - - p 15

We proposed a Learned Spatio-Temporal Adaptive Pooling method to replace global pooling in CNNs in the context of video captioning

This method leads to significant improvements with respect to the naive approach

Video Captioning is one promising direction for improving TV user experience and TV archives management

Page 16: Learned Spatio-Temporal Adaptive Pooling for Video Captioning · Learned Spatio-Temporal Adaptive Pooling for Video Captioning Danny FRANCIS and Benoit HUET AI4TV 2019, Nice

The EndThe End

Thank you!

03/09/18 - - p 16