![Page 1: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/1.jpg)
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
The University of Tokyo
Yoshitaka Ushiku
losnuevetoros
![Page 2: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/2.jpg)
Documents = Vision + Language
Vision & Language:
an emerging topic
• Integration of CV, NLP
and ML techs
• Several backgrounds
– Impact of Deep Learning
• Image recognition (CV)
• Machine translation (NLP)
– Growth of user generated
contents
– Exploratory researches on
Vision and Language
![Page 3: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/3.jpg)
2012: Impact of Deep Learning
Academic AI startup A famous company
Many slides refer to the first use of CNN (AlexNet) on ImageNet
![Page 4: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/4.jpg)
2012: Impact of Deep Learning
Academic AI startup A famous company
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Many slides refer to the first use of CNN (AlexNet) on ImageNet
![Page 5: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/5.jpg)
2012: Impact of Deep Learning
According to the official site…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]
It’s me!!
![Page 6: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/6.jpg)
2014: Another impact of Deep Learning
• Deep learning appears in machine translation[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
• Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
![Page 7: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/7.jpg)
Growth of user generated contents
Especially in content posting/sharing service
• Facebook: 300 million photos per day
• YouTube: 400-hours videos per minute
Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular.
Pairs of a sentence+ a video / photo→Collectable in
large quantities
![Page 8: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/8.jpg)
Exploratory researches on Vision and Language
Captioning an image associated with its article[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.
![Page 9: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/9.jpg)
Exploratory researches on Vision and Language
Captioning an image associated with its article[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.As a result of these backgrounds:
Various research topics such as …
![Page 10: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/10.jpg)
Image Captioning
Group of people sitting at a table with a dinner.
Tourists are standing on the middle of a flat desert.
[Ushiku+, ICCV 2015]
![Page 11: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/11.jpg)
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
![Page 12: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/12.jpg)
Multilingual + Image Caption Translation
Ein Masten mit zwei Ampeln
fur Autofahrer. (German)
A pole with two lights
for drivers. (English)
[Hitschler+, ACL 2016]
![Page 13: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/13.jpg)
Visual Question Answering[Fukui+, EMNLP 2016]
![Page 14: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/14.jpg)
Image Generation from Captions
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
![Page 15: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/15.jpg)
Goal of this keynote
Looking over researches on vision&language
• Historical flow of each area
• Changes by Deep Learning
× Deep Learning enabled these researches
✓ Deep Learning boosted these researches
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
![Page 16: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/16.jpg)
Frontiers of Vision and Language 1
Image Captioning
![Page 17: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/17.jpg)
Every picture tells a story
Dataset:Images + <object, action, scene> + Captions
1. Predict <object, action, scene> for an input image using MRF
2. Search for the existing caption associated with similar <object, action, scene>
<Horse, Ride, Field>
[Farhadi+, ECCV 2010]
![Page 18: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/18.jpg)
Every picture tells a story
<pet, sleep, ground>
See something unexpected.
<transportation, move, track>
A man stands next to a train
on a cloudy day.
[Farhadi+, ECCV 2010]
![Page 19: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/19.jpg)
Retrieve? Generate?
• Retrieve
• Generate
– Template-basede.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
![Page 20: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/20.jpg)
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-basede.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
![Page 21: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/21.jpg)
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-baseddog+stand ⇒ A dog stands.
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
![Page 22: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/22.jpg)
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-baseddog+stand ⇒ A dog stands.
– Template-free
A small white dog standing on a leash.
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
![Page 23: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/23.jpg)
Captioning with multi-keyphrases[Ushiku+, ACM MM 2012]
![Page 24: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/24.jpg)
End of sentence
[Ushiku+, ACM MM 2012]
![Page 25: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/25.jpg)
Benefits of Deep Learning
• Refinement of image recognition [Krizhevsky+, NIPS 2012]
• Deep learning appears in machine translation[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
![Page 26: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/26.jpg)
Google NIC
Concatenation of Google’s methods
• GoogLeNet [Szegedy+, CVPR 2015]
• MT with LSTM[Sutskever+, NIPS 2014]
Caption (word seq.) 𝑆0…𝑆𝑁 for image 𝐼
𝑆0: beginning of the sentence
𝑆1 = LSTM CNN 𝐼
𝑆𝑡 = LSTM St−1 , 𝑡 = 2…𝑁 − 1
𝑆𝑁: end of the sentence
[Vinyals+, CVPR 2015]
![Page 27: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/27.jpg)
Examples of generated captions
[https://github.com/tensorflow/models/tree/master/im2txt]
[Vinyals+, CVPR 2015]
![Page 28: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/28.jpg)
Comparison to [Ushiku+, ACM MM 2012]
Input image
[Ushiku+, ACM MM 2012]:
Conventional object recognition
Fisher Vector + Linear classifier
Neural image captioning:
Conventional object recognition
Convolutional Neural Network
Neural image captioning
Conventional machine translation
Recurrent Neural Network + beam search
[Ushiku+, ACM MM 2012]:
Conventional machine translation
Log Linear Model + beam search
Estimation of important words Connect the words with grammar model
• Trained using only images and captions
• Approaches are similar to each other
![Page 29: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/29.jpg)
Current development: Accuracy
• Attention-based captioning [Xu+, ICML 2015]
– Focus on some areas for predicting each word!
– Both attention and caption models are trained
using pairs of an image & caption
![Page 30: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/30.jpg)
Current development: Problem setting
Dense captioning
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]
![Page 31: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/31.jpg)
Current development: Problem setting
Generating captions for a photo sequence[Park+Kim, NIPS 2015][Huang+, NAACL 2016]
The family
got
together for
a cookout.
They had a
lot of
delicious
food.
The dog
was happy
to be there.
They had a
great time
on the
beach.
They even
had a swim
in the water.
![Page 32: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/32.jpg)
Current development: Problem setting
Captioning using sentiment terms
[Mathews+, AAAI 2016][Shin+, BMVC 2016]
Neutral caption
Positive caption
![Page 33: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/33.jpg)
Frontiers of Vision and Language 2
Video Captioning
![Page 34: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/34.jpg)
Before Deep Learning
• Grounding of languages and objects in videos[Yu+Siskind, ACL 2013]
– Learning from only videos and their captions
– Experiment with a small object with few objects
– Controlled and small dataset
• Deep Learning should suite for this problem
– Image Captioning: single image → word sequence
– Video Captioning: image sequence → word sequence
![Page 35: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/35.jpg)
End-to-end learning by Deep Learning
• LRCN[Donahue+, CVPR 2015]
– CNN+RNN for
• Action recognition
• Image / Video
Captioning
• Video to Text[Venugopalan+, ICCV 2015]
– CNNs to recognize
• Objects from RGB frames
• Actions from flow images
– RNN for captioning
![Page 36: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/36.jpg)
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
![Page 37: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/37.jpg)
Video Captioning
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]
![Page 38: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/38.jpg)
Video Retrieval from Caption
• Input: Captions
• Output: A video related to the caption
10 sec video clip from 40 min database!
• Video captioning is also addressed
A woman in blue is
playing ping pong in a
room.
A guy is skiing with no
shirt on and yellow
snow pants.
A man is water skiing
while attached to a
long rope.
[Yamaguchi+, ICCV 2017]
![Page 39: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/39.jpg)
Frontiers of Vision and Language 3
Multilingual +
Image Caption Translation
![Page 40: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/40.jpg)
Towards multiple languages
Datasets with multilingual captions
• IAPR TC12 [Grubinger+, 2006] English + Germany
• Multi30K [Elliot+, 2016] English + Germany
• STAIR Captions [Yoshikawa+, 2017]
English + Japanese
Development of cross-lingual tasks
• Non-English-caption generation
• Image Caption Transration
Input: Pair of a caption in Language A + an imageor A caption in Language A
Output: Caption in Language B
![Page 41: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/41.jpg)
Non-English-caption generation
![Page 42: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/42.jpg)
Non-English-caption generation
Most researches: generate English Caption
• Japanese [Miyazaki+Shimizu, ACL 2016]
• Chinese [Li+, ICMR 2016]
• Turkish [Unal+, SIU 2016]
Çimlerde ko¸ san bir köpek
金色头发的小女孩
柵の中にキリンが一頭立っています
![Page 43: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/43.jpg)
Just collecting non-English captions?
Transfer learning among languages[Miyazaki+Shimizu, ACL 2016]
• Vision-Language grounding Wim is transferred
• Efficient learning using small amount of captionsan elephant is
an elephant
一匹の 象が 土の
一匹の 象が
![Page 44: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/44.jpg)
Image Caption Translation
![Page 45: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/45.jpg)
Machine translation via visual data
Images can boost MT [Calixto+,2012]
• Example below (English to Portuguese):
Does the word “seal” in English
– mean “seal” similar to “stamp”?
– mean “seal” which is a sea animal?
• [Calixto+,2012] insist that the mistranslation can be
avoided using a related image (w/o experiments)
Mistranslation!
![Page 46: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/46.jpg)
Input: Caption in Language A + image
• Caption translation via an associated image[Elliott+, 2015] [Hitschler+, ACL 2016]
– Generate translation candidates
– Re-rank the candidates using similar images’
captions in Language B
Eine Person in
einem Anzug
und Krawatte
und einem Rock.
(In German)
Translation w/o the related image
A person in a suit and tie
and a rock.
Translation with the related image
A person in a suit and tie
and a skirt.
![Page 47: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/47.jpg)
Input: Caption in Language A
• Cross-lingual document retrieval via images [Funaki+Nakayama, EMNLP 2015]
• Zero-shot machine translation[Nakayama+Nishida, 2017]
![Page 48: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/48.jpg)
Frontiers of Vision and Language 4
Visual Question Answering
![Page 49: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/49.jpg)
Visual Question Answering (VQA)
Proposed in Human-Computer Interfaces
• VizWiz [Bigham+, UIST 2010]
Manually solved on AMT
• Automation for the first time (w/o Deep Learning)[Malinowski+Fritz, NIPS 2014]
• Similar term: Visual Turing Test [Malinowski+Fritz, 2014]
![Page 50: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/50.jpg)
VQA: Visual Question Answering
• Established VQA as an AI problem
– Provided a benchmark dataset
– Experimental results with reasonable baselines
• Portal web site is also organized
– http://www.visualqa.org/
– Annual competition for VQA accuracy
[Antol+, ICCV 2015]
What color are her eyes?What is the mustache made of?
![Page 51: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/51.jpg)
VQA Dataset
Collected questions and answers on AMT
• Over 100K real images and 30K abstract images
• About 700K questions+10 answers for each
![Page 52: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/52.jpg)
VQA=Multiclass Classification
Feature 𝑍𝐼+𝑄 is applied to an usual classifier
Question 𝑄What objects are
found on the bed?
Answer 𝐴bed sheets, pillow
Image 𝐼Image feature
𝑥𝐼
Question feature
𝑥𝑄
Integrated feature
𝑧𝐼+𝑄
![Page 53: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/53.jpg)
Development of VQA
How to calculate the integrated feature 𝑧𝐼+𝑄?
• VQA [Antol+, ICCV 2015]: Just concatenate them
• Summation例 Summation of an image feature with attention
and a question feature [Xu+Saenko, ECCV 2016]
• Multiplicatione.g.Bilinear multiplication using DFT
[Fukui+, EMNLP 2016]
• Hybrid of summation and multiplicatione.g.Concatenation of sum and multiplication
[Saito+, ICME 2017]
𝑧𝐼+𝑄 =𝑥𝐼
𝑥𝑄
𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =𝑥𝐼 𝑥𝑄
𝑥𝐼 𝑥𝑄
![Page 54: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/54.jpg)
VQA Challenge
Examples from competition results
Q: What is the woman holding?GT A: laptopMachine A: laptop
Q: Is it going to rain soon?GT A: yesMachine A: yes
![Page 55: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/55.jpg)
VQA Challenge
Examples from competition results
Q: Why is there snow on one side of the stream and clear grass on the other?GT A: shadeMachine A: yes
Q: Is the hydrant painted a new color?GT A: yesMachine A: no
![Page 56: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/56.jpg)
Frontiers of Vision and Language 5
Image Generation from Captions
![Page 57: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/57.jpg)
Image generation from input caption
Photo-realistic image generation itself is difficult
• [Mansimov+, ICLR 2016]: Incrementally draw using LSTM
• N.B. Photo synthesis is well studied [Hays+Efros, 2007]
![Page 58: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/58.jpg)
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
![Page 59: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/59.jpg)
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
![Page 60: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/60.jpg)
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
![Page 61: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/61.jpg)
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
![Page 62: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/62.jpg)
Generative Adversarial Networks (GAN)[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a … hmm
![Page 63: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/63.jpg)
Add a Caption to Generator and Discriminator
Conditional Generative Models
Tries to generate an image・photo-realistic
・related to the caption
Tries to detect an image・fake
・unrelated
[Reed+, ICML 2016]
![Page 64: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/64.jpg)
Examples of generated images
• Birds (CUB) / Flowers (Oxford-102)
– About 10K images & 5 captions for each image
– 200 kinds of birds / 102 kinds of flowers
A tiny bird, with a tiny beak,
tarsus and feet, a blue crown,
blue coverts, and black
cheek patch
Bright droopy yellow petals
with burgundy streaks, and a
yellow stigma
[Reed+, ICML 2016]
![Page 65: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/65.jpg)
Towards more realistic image generation
StackGAN [Zhang+, 2016]
Two-step GANs
• First GAN generates small and fuzzy image
• Second GAN enlarges and refines it
![Page 66: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/66.jpg)
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
![Page 67: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/67.jpg)
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
N.B. Results using dataset specialized in birds / flowers
→ More breakthrough is necessary to generate general images
![Page 68: Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning](https://reader030.vdocuments.mx/reader030/viewer/2022020314/5a64786e7f8b9a31568b45b3/html5/thumbnails/68.jpg)
Take-home Messages
• Looked over researches on vision and language
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
• Contributions of Deep Learning– Most research themes exist before Deep Learning
– Commodity techs for processing images, videos and natural languages
– Evolution of recognition and generation
Towards a new stage among vision and language!