deep learning for computer vision: language and vision (upc 2016)

Day 4 Lecture 3

Language and Vision

Xavier Giroacute-i-Nieto

2

Acknowledgments

Santi Pascual

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8


Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001


One-hot is a very simple representation every word is equidistant from every other word


10

Encoder Projection to continious space


siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11



siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12



Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence


15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder


RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder


With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder


More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder


24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq


26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28



only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30



31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33



34



XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap


36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite


Slide credit Issey Masuda

39


Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41


(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering


50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]


51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding


53

Learn moreJulia Hockenmeirer

54

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

2

Acknowledgments

Santi Pascual

3



Language IN

Language OUT

4

Motivation

5



6

Encoder-Decoder




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




3



Language IN

Language OUT

4

Motivation

5



6

Encoder-Decoder




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




4

Motivation

5



6

Encoder-Decoder




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




5



6

Encoder-Decoder




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




6

Encoder-Decoder




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




7




zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




8




economic 000010

growth 001000

has 100000

slowed 000001




10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54







10



siM WiE


K

K

11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




11



siM Wi


K

12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




12





Sequence of words

13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




13

Encoder Recurrence

Sequence


14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




14

Encoder Recurrence


15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




15

Encoder Recurrence

time

time


Rotation 90o

16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




16


Rotation 90o

Side View


17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




17

Sentence Embedding



18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




18

(Word Embeddings)


19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




19

Decoder



20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




20

Decoder



RNN internal

state

Neuron weights for

word k

21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




21

Decoder



Score for word k



22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




22

Decoder



EOS

23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




23

Encoder-Decoder


24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




24



25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




25



26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




26


27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




27



28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




28





29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




29



30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




30



31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




31



32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




32



33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




33



34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




34






35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




35



36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




36

Captioning HRNE



Time

Image

t = 1 t = T


first chunkof data

37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




37




ldquoYesrdquo

EncodeEncode

Decode

38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




38


Embedding

Predict answerMerge

Question


AnswerKite



39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




39




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




40



41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




41






42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




42



43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




43



44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




44


Microsoft SIND

45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




45


Captioning

46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




46


Storytelling

47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




47



48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




48



49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




49



50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




50

1000

Humans

8330


6647

Baseline LSTMampCNN

5406


4285


3747

Baseline All yes

2988

5362



51




thesis



52

Conclusions



53


54




51




thesis



52

Conclusions



53


54




52

Conclusions



53


54




53


54




54




deep learning for computer vision: language and vision (upc 2016)

Engineering