![Page 1: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/1.jpg)
Show, Attend, and Tell Neural Image Caption Generation
with Visual AttentionKelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,
Ruslan Salakhutdinov, Richard S. Zemel, Yoshua BengioUniversity of Montreal and University of Toronto
Presented By:Hannah Li, Sivaraman K S
1
![Page 2: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/2.jpg)
Introduction We can easily:
Segment, localize, and categorize
However,
Interpreting the image is more difficult
Goal of this work: Generate captions for images using attention mechanism2
![Page 3: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/3.jpg)
Related Work - Generating Image Captions- Recurrent neural networks (Cho et al., 2014, Bahdanau et al., 2014,
Stuskever et al., 2014)
- LSTM for videos and images (Vinyals et al., 2014, Donahue et al., 2014)
- Joint CNN-RNN with object detection (Karpathy & Li, 2014, Fang et al., 2014)
- Attention (Larochelle & Hinton 2010)
3
![Page 4: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/4.jpg)
Model Overview
4
Generates a caption y as a sequence of encoded words
![Page 5: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/5.jpg)
Encoder: Convolutional FeaturesGoal: input raw image and produce a set of feature vectors (annotation vectors)
Produces L vectors (each a D-dimensional representation corresponding to part of an image)
5
![Page 6: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/6.jpg)
Decoder: Long Short-Term Memory Network
6
Input, forget, memory, output and hidden state
W, U, Z: weight matricesb: biasesE: an embedding matrixzt: representation of the relevant part of the image at time t
![Page 7: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/7.jpg)
Decoder: Long Short-Term Memory Network
7
Logistic sigmoid activationDeep output layer to compute the output word probability
Stochastic attention: the probability that location i is the correct place to focus on for producing the next word
Deterministic attention: the relative importance to give to location i in blending the ai’s together
![Page 8: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/8.jpg)
Hard AttentionHard attention - learning to maximize the context vector z from a combination of a one-hot encoded variable st,i and the extracted features ai.
Trained using Sampling method
st - where the model decides to focus attention when generating the tth word
Stochastic - Assign a Multinoulli distribution
8
![Page 9: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/9.jpg)
Soft AttentionLearning by maximizing the expectation of the context vector.
Trained End-to-End
Deterministic - Whole distribution optimized, not single choices (st not picked from a distribution)
9
![Page 10: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/10.jpg)
TrainingThe attention framework learns latent alignments from scratch instead of explicitly using object detectors.
Allows the model to go beyond "objectness" and learn to attend to abstract concepts.
10
![Page 11: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/11.jpg)
Dataset Flickr8k and Flickr30k datasets
5 reference captions
MS COCO dataset
Discarded caption in excess of 5
Applied basic tokenization
Fixed vocabulary size of 10K
11
![Page 12: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/12.jpg)
Results
1. Significantly improve the state of the art performance METEOR on MS COCO 2. More flexibility - attend to non object salient regions
12
![Page 13: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/13.jpg)
13Source: http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.42.58-PM.png
![Page 14: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/14.jpg)
Analysis of learning to attend
14
![Page 15: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/15.jpg)
Mistakes
15
![Page 16: Show, Attend, and Tell Neural Image Caption Generation ...vicente/vislang/slides/sivahaina.pdf · Introduction We can easily: Segment, localize, and categorize ... MS COCO dataset](https://reader033.vdocuments.mx/reader033/viewer/2022060421/5f17a89ddf2e2e55702edf01/html5/thumbnails/16.jpg)
Reference● https://arxiv.org/pdf/1502.03044.pdf● http://kelvinxu.github.io/projects/capgen.html● http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nl
p/● https://blog.heuritech.com/2016/01/20/attention-mechanism/
16