multimodal learning for image captioning and visual ... · pdf filemultimodal learning for...
TRANSCRIPT
![Page 1: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/1.jpg)
Multimodal Learning for Image Captioning and Visual Question Answering
Xiaodong He
Deep Learning Technology Center
Microsoft Research
UC Berkeley, April 7th, 2016
![Page 2: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/2.jpg)
![Page 3: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/3.jpg)
![Page 4: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/4.jpg)
Knowledge
VisionText
Barack Obama is an American politician
serving as the 44th President of the
United States. Born in Honolulu, Hawaii,
… in 2008, he defeated Republican
nominee and was inaugurated as president
on January 20, 2009.(Wikipedia.org)
http://s122.photobucket.com/user/b
meuppls/media/stampede.jpg.html
Freebase
![Page 5: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/5.jpg)
a man holding a tennis racquet on a tennis court
the man is on the tennis court playing a game
Image Captioning (one step from perception to cognition)
describe objects, attributes, and relationship in an image, in a
natural language form
![Page 6: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/6.jpg)
Two entries tied at the 1st place at COCO 2015 Caption Challenge
Adopted encoder-decoder framework from machine translation, Popular: Google, Montreal, Stanford, Berkeley
Visual concept detection => caption candidates generation => Deep semantic ranking
Compositional framework can potentially exploit non paired image-
caption data more effectively
[Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, Mitchell, Platt, Zitnick,
Zweig, “From Captions to Visual Concepts and Back,” CVPR, June 2015]
Vinyals, Toshev, Bengio, Erhan, "Show and Tell: A Neural
Image Caption Generator,“ CVPR, June 2015
![Page 7: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/7.jpg)
sitting
![Page 8: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/8.jpg)
![Page 9: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/9.jpg)
![Page 10: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/10.jpg)
cabinets
woodenkitchen
sink cabinets Repeat to generate 500 candidatesfloor
room
stove
[Fang, et al., CVPR 2015]
![Page 11: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/11.jpg)
Huang, He, Gao, Deng, Acero, Heck, “Learning Deep
Structured Semantic Model for Web Search,“ CIKM, 2013
![Page 12: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/12.jpg)
![Page 13: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/13.jpg)
15K 15K 15K 15K 15K
500 500 500
max max
...
...
... max
500
...
...
Word hashing layer: ft
Convolutional layer: ht
Max pooling layer: v
Semantic layer: y
<s> w1 w2 wT <s>Word sequence: xt
Word hashing matrix: Wf
Convolution matrix: Wc
Max pooling operation
Semantic projection matrix: Ws
... ...
500
a man… bench
![Page 14: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/14.jpg)
– What does the model learn at the
convolutional layer?
Capture the local context dependent word sense• Learn one embedding vector for each local context-
dependent word
car body shop cosine
similarity
car body kits 0.698
auto body repair 0.578
auto body parts 0.555
wave body language 0.301
calculate body fat 0.220
forcefield body armour 0.165
The similarity between different “body” within contexts
high
similarity
low
similarity
wave body language
car body kits
auto body part
auto body repair
car body shop
forcefield body armour
calculate body fat
semantic space
auto body repair …
ℎ𝑡 = 𝑊𝑐 × [𝑓𝑡−1, 𝑓𝑡, 𝑓𝑡+1]
![Page 15: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/15.jpg)
global intent
𝑣 𝑖 = max𝑡=1,…,𝑇
ℎ𝑡(𝑖)
auto body repair cost calculator software
Words that win the most active neurons at the max-
pooling layers:
Usually, those are salient words containing clear intents/topics
𝑖 = 1,… , 300
ℎ1
𝑣
ℎ2 ℎ𝑇
![Page 16: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/16.jpg)
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.33
1 9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
193
Mean Reciprocal Rank % (ranking among 5000
candidates on the 5K validation set)
CDSSM d=300
CDSSM d=1000
DSSM d=300
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4
1 8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
Hamonic Mean Rank (ranking among 5000
candidates on the 5K val set)
CDSSM d=300
CDSSM d=1000
DSSM d=300
![Page 17: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/17.jpg)
Turing Test Results
at the MS COCO Captioning Challenge 2015
% of captions
that pass the
Turing Test
Official
Rank
MSR 32.2% 1st
Google 31.7% 1st
MSR Captivator 30.1% 3rd
Montreal/Toronto 27.2% 3rd
Berkeley LRCN 26.8% 5th
Other groups: Baidu/UCLA, Stanford, Tsinghua, etc.
Human 67.5% --
Still a big gap!
![Page 18: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/18.jpg)
![Page 19: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/19.jpg)
System BLEU % Better or
Equal to
Human
Model 1: MELM + DMSM 25.7 34.0%
Model 2: MRNN 25.7 29.0%
Devlin, Cheng, Fang, Gupta, Deng, He, Zweig, and Mitchell “Language
Models for Image Captioning: The Quirks and What Works,” ACL 2015
Human judgers shown generated caption and human caption, choose which is “better”, or equal.
![Page 20: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/20.jpg)
![Page 21: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/21.jpg)
Example: MELM+DMSM: “A plate with a sandwich and a cup of coffee”
MRNN: “A close up of a plate of food” (more generic)
![Page 22: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/22.jpg)
•
•
•
•
•
![Page 23: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/23.jpg)
Visual concepts
Celebrity
Landmark
Language Model
Confidence Model
DMSMFeatures vector
A small boat in Ha Long Bay
This image contains: water, boat, lake, mountain, etc.
low
highConvNets
[Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun,
Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz
submitted to CVPR Deep Vision 2016]
![Page 24: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/24.jpg)
[He, Zhang, Ren, Sun, 2015]
![Page 25: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/25.jpg)
The deep multimodal semantic model semantic space:
The overall semantics of a caption will also be represented by a vector in this space.
If these two vectors are close to each other, then
the caption is a good match for the image.
Otherwise, not a matching caption.
Image feature
H1
H2
H3
W1
W2
W3
W4
Input s
H3
Text: a man holding a tennis
racquet on a tennis court
H1
H2
H3
W1
W2
W3
Input t1
H3W4
Raw Image pixels
Convolution/pooling
Fully connected
[Fang, et al., CVPR 2015]
[Huang, He, Gao, Deng et al., 2013]
[He, Zhang, Ren, Sun, 2015]
![Page 26: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/26.jpg)
•
•
[Guo, Zhang, Hu, He, Gao, 2016]
![Page 27: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/27.jpg)
Image
H1
H2
H3
W1
W2
W3
W4
Input s
H3
caption: a man holding a
tennis racquet on a tennis
court
H1
H2
H3
W1
W2
W3
Input t1
H3
W4
![Page 28: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/28.jpg)
System Excellent Good Bad Embarrassing
Fang et al.,
2015
40.6% 26.8% 28.8% 3.8%
New
system
51.8% 23.4% 22.5% 2.4%
Human evaluation on 1000 random samples of the COCO test set.
![Page 29: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/29.jpg)
System Excellent Good Bad Embarrassing
Fang et al.,
2015
12.0% 13.4% 63.0% 11.6%
New
system
25.4% 24.1% 45.3% 5.2%
Human evaluation on Instagram test set, which contains 1380 random images that
we scraped from Instagram.
![Page 30: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/30.jpg)
Conf. score Excellent Good Bad Embarrassing
mean 0.59 0.51 0.26 0.20
Std dev 0.21 0.23 0.21 0.19
![Page 31: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/31.jpg)
a man wearing a suit and tieIan Somerhalder wearing a suit and tie
a man taking a picture in front of a mirroran picture about person
a woman standing in front of a christmas treea woman standing next to a window
a black and white photo of a man wearing a hata man posing for a picture
Above: Fang2015Below: Ours
a man on a skateboardthis picture is about photo
a man holding a stop signa man holding a stop sign
a colorful kite flying in the aira table topped with a kite
a couple of people at nighta fire hydrant that is lit up at night
a black and white photo of a man wearing a hata man wearing a bow tie looking at the camera
a view of a sunset over watera view of a sunset in the ocean
a dog sitting on top of a grass covered fielda dog sitting in the grass
a man holding a baseball bat at a balla man swinging a baseball bat in front of a crowd
![Page 32: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/32.jpg)
a woman sitting on a couchthis picture is about person
a woman holding a red umbrellathe image is about person
two women standing in front of a cakea woman posing for a picture
a man holding a baseball bat on a fielda boy standing in front of a building
a person holding a cell phonea hand holding a cell phone
a man holding a teddy beara picture about table
a pair of scissors sitting on top of a tablea bunch of different items
a woman sitting on a bencha woman sitting on a bench
a black and white photo of a woman brushing her haira woman standing in front of a mirror
a man and a woman wearing a tiea couple posing for a photo
a pair of scissorsthe image is about clothing
a group of pictures on the wallthis picture seems contain text
![Page 33: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/33.jpg)
http://CaptionBot.aiCognitive Services
![Page 34: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/34.jpg)
![Page 35: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/35.jpg)
when Jen-Hsun Huang was giving a keynote
showing off a GPU-powered VR visiting of mt.
Everest -- here is what our CaptionBot has to say.
![Page 36: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/36.jpg)
![Page 37: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/37.jpg)
![Page 38: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/38.jpg)
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, "Stacked Attention Networks for Image
Question Answering," CVPR 2016 (oral)
![Page 39: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/39.jpg)
![Page 40: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/40.jpg)
![Page 41: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/41.jpg)
![Page 42: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/42.jpg)
![Page 43: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/43.jpg)
![Page 44: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/44.jpg)
Big improvement
![Page 45: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/45.jpg)
umbrella
![Page 46: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/46.jpg)
![Page 47: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/47.jpg)
![Page 48: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/48.jpg)
a herd of elephants standing next to a man
a herd of elephants standing next to Obama
Obama is chased by his republic competitors Image credit:
http://s122.photobucket.com/user/bmeup
pls/media/stampede.jpg.html
Republic Party
Obama the president from Democratic party
whose competitor is Republic party
mascot is Elephant
Who is that person?
What are behind that man?
Why these elephants are chasing him?
![Page 49: Multimodal Learning for Image Captioning and Visual ... · PDF fileMultimodal Learning for Image Captioning and Visual Question Answering ... (Wikipedia.org) ... Ian Somerhalder wearing](https://reader034.vdocuments.mx/reader034/viewer/2022042708/5a94fe5a7f8b9ab6188c1bcb/html5/thumbnails/49.jpg)
Character-Level Question Answering with Attention
Reasoning in Vector Space: An Exploratory Study of Question Answering
Deep Reinforcement Learning with an Action Space Defined by Natural Language
Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base
Semantic Parsing for Single-Relation Question Answering
Embedding Entities and Relations for Learning and Inference in Knowledge Bases
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data
Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval