deep learning for computer vision: language and vision (upc 2016)

54
Day 4 Lecture 3 Language and Vision Xavier Giró-i-Nieto

Upload: xavier-giro

Post on 17-Jan-2017

394 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Deep Learning for Computer Vision: Language and vision (UPC 2016)

Day 4 Lecture 3

Language and Vision

Xavier Giroacute-i-Nieto

2

Acknowledgments

Santi Pascual

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 2: Deep Learning for Computer Vision: Language and vision (UPC 2016)

2

Acknowledgments

Santi Pascual

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 3: Deep Learning for Computer Vision: Language and vision (UPC 2016)

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 4: Deep Learning for Computer Vision: Language and vision (UPC 2016)

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 5: Deep Learning for Computer Vision: Language and vision (UPC 2016)

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 6: Deep Learning for Computer Vision: Language and vision (UPC 2016)

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 7: Deep Learning for Computer Vision: Language and vision (UPC 2016)

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 8: Deep Learning for Computer Vision: Language and vision (UPC 2016)

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 9: Deep Learning for Computer Vision: Language and vision (UPC 2016)

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 10: Deep Learning for Computer Vision: Language and vision (UPC 2016)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 11: Deep Learning for Computer Vision: Language and vision (UPC 2016)

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 12: Deep Learning for Computer Vision: Language and vision (UPC 2016)

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 13: Deep Learning for Computer Vision: Language and vision (UPC 2016)

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 14: Deep Learning for Computer Vision: Language and vision (UPC 2016)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 15: Deep Learning for Computer Vision: Language and vision (UPC 2016)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 16: Deep Learning for Computer Vision: Language and vision (UPC 2016)

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 17: Deep Learning for Computer Vision: Language and vision (UPC 2016)

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 18: Deep Learning for Computer Vision: Language and vision (UPC 2016)

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 19: Deep Learning for Computer Vision: Language and vision (UPC 2016)

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 20: Deep Learning for Computer Vision: Language and vision (UPC 2016)

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 21: Deep Learning for Computer Vision: Language and vision (UPC 2016)

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 22: Deep Learning for Computer Vision: Language and vision (UPC 2016)

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 23: Deep Learning for Computer Vision: Language and vision (UPC 2016)

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 24: Deep Learning for Computer Vision: Language and vision (UPC 2016)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 25: Deep Learning for Computer Vision: Language and vision (UPC 2016)

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 26: Deep Learning for Computer Vision: Language and vision (UPC 2016)

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 27: Deep Learning for Computer Vision: Language and vision (UPC 2016)

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 28: Deep Learning for Computer Vision: Language and vision (UPC 2016)

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 29: Deep Learning for Computer Vision: Language and vision (UPC 2016)

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 30: Deep Learning for Computer Vision: Language and vision (UPC 2016)

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 31: Deep Learning for Computer Vision: Language and vision (UPC 2016)

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 32: Deep Learning for Computer Vision: Language and vision (UPC 2016)

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 33: Deep Learning for Computer Vision: Language and vision (UPC 2016)

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 34: Deep Learning for Computer Vision: Language and vision (UPC 2016)

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 35: Deep Learning for Computer Vision: Language and vision (UPC 2016)

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 36: Deep Learning for Computer Vision: Language and vision (UPC 2016)

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 37: Deep Learning for Computer Vision: Language and vision (UPC 2016)

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 38: Deep Learning for Computer Vision: Language and vision (UPC 2016)

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 39: Deep Learning for Computer Vision: Language and vision (UPC 2016)

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 40: Deep Learning for Computer Vision: Language and vision (UPC 2016)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 41: Deep Learning for Computer Vision: Language and vision (UPC 2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 42: Deep Learning for Computer Vision: Language and vision (UPC 2016)

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 43: Deep Learning for Computer Vision: Language and vision (UPC 2016)

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 44: Deep Learning for Computer Vision: Language and vision (UPC 2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 45: Deep Learning for Computer Vision: Language and vision (UPC 2016)

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 46: Deep Learning for Computer Vision: Language and vision (UPC 2016)

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 47: Deep Learning for Computer Vision: Language and vision (UPC 2016)

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 48: Deep Learning for Computer Vision: Language and vision (UPC 2016)

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 49: Deep Learning for Computer Vision: Language and vision (UPC 2016)

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 50: Deep Learning for Computer Vision: Language and vision (UPC 2016)

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 51: Deep Learning for Computer Vision: Language and vision (UPC 2016)

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 52: Deep Learning for Computer Vision: Language and vision (UPC 2016)

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 53: Deep Learning for Computer Vision: Language and vision (UPC 2016)

53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 54: Deep Learning for Computer Vision: Language and vision (UPC 2016)

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi