![Page 1: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/1.jpg)
LSTMs Overview
Subhashini Venugopalan
![Page 2: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/2.jpg)
Neural Networks
B
zt
Input
Hidden
Hidden
Output
![Page 3: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/3.jpg)
WHY RNNs/LSTMs?
Accepts only fixed size input e.g 224x224 images.
Performs a fixed number of computations (#layers).
Outputs a fixed size vector.
B
zt
Input
Hidden
Hidden
Output
Can we operate over sequences of inputs?
Limitations of vanilla Neural Networks
![Page 4: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/4.jpg)
Recurrent Neural Networks
Image Credit: Chris Olah
They are networks with loops. [Elman ‘90]
![Page 5: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/5.jpg)
Un-Roll The Loop
Image Credit: Chris Olah
Recurrent Neural Network “unrolled in time”
● Each time step has a layer with the same weights.● The repeating layer/module is a sigmoid or a tanh.● Learns to model (ht | x1, …, xt-1)
![Page 6: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/6.jpg)
Simple RNNs
Image Credit: Chris Olah
sigmoidor
tanh
sigmoid or tanh
![Page 7: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/7.jpg)
Problems with Simple RNNs
● Can’t seem to handle “long-term dependencies” in practice● Gradients shrink through the many layers (Vanishing Gradients)
[Hochreiter ‘91][Bengio et. al. ‘94]Image Credit: Chris Olah
![Page 8: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/8.jpg)
Long Short Term Memory (LSTMs)
Image Credit: Chris Olah[Hochreiter and Schmidhuber ‘97]
![Page 9: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/9.jpg)
LSTM Unit
xt
ht-1
xt ht-1 xt ht-1
xt ht-1
ht
Memory Cell
Output GateInput
Gate
Forget Gate
Input ModulationGate
+
Memory Cell:Core of the LSTM UnitEncodes all inputs observed
[Hochreiter and Schmidhuber ‘97][Graves ‘13]
![Page 10: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/10.jpg)
LSTM Unit
xt
ht-1
xt ht-1 xt ht-1
xt ht-1
ht
Memory Cell
Output GateInput
Gate
Forget Gate
Input ModulationGate
+
Memory Cell:Core of the LSTM UnitEncodes all inputs observed
Gates:Input, Output and ForgetSigmoid [0,1]
[Hochreiter and Schmidhuber ‘97][Graves ‘13]
![Page 11: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/11.jpg)
LSTM Unit
xt
ht-1
xt ht-1 xt ht-1
xt ht-1
ht
Memory Cell
Output GateInput
Gate
Forget Gate
Input ModulationGate
+
[Hochreiter and Schmidhuber ‘97][Graves ‘13]
Update the Cell state
Learns long-term dependencies
![Page 12: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/12.jpg)
Can Model Sequences
● Can handle longer-term dependencies● Overcomes Vanishing Gradients problem● GRUs - Gated Recurrent Units is a much simpler variant which also
overcomes these issues.
LSTM LSTM LSTM LSTM
[Cho et. al. ‘14]
![Page 13: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/13.jpg)
Putting Things Together
Image Credit: Sutskever et. al.
Encode a sequence of inputs to a vector.(ht | x1, …, xt-1)
Decode from the vector to a sequence of outputs.Pr(xt | x1, …, xt-1)
![Page 14: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/14.jpg)
SOLVE A WIDER RANGE OF PROBLEMS
Image Captioning
Image Credit: Andrej Karpathy
Activity Recognition Sequence to Sequence
Vinyals et. al. ‘15,Donahue et. al. ‘15
Donahue et. al. ‘15 Machine TranslationSpeech RecognitionVideo DescriptionVQA, POS tagging, ...
Sutskever et. al. ‘14, Cho et. al. ‘14Graves & Jaitly ‘14V. et. al. ‘15, Li et. al. ‘153 of 4 papers to be discussed this class
![Page 15: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/15.jpg)
Resources● Graves’ paper - LSTMs explanation. Generating sequences with recurrent
neural networks. Applications to handwriting and speech recognition.● Chris’ Blog - LSTM unit explanation.● Karpathy’s Blog - Applications.● Tensorflow and Caffe - Code examples.
![Page 16: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/16.jpg)
Sequence to Sequence Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue Raymond Mooney, Trevor Darrell, Kate Saenko
![Page 17: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/17.jpg)
Objective
A monkey is pulling a dog’s tail and is chased by the dog.
![Page 18: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/18.jpg)
Encode
Recurrent Neural Networks (RNNs) can map a vector to a sequence.
[Donahue et al. CVPR’15]
[Sutskever et al. NIPS’14]
[Vinyals et al. CVPR’15]
EnglishSentence
RNN encoder
RNN decoder
FrenchSentence
Encode RNN decoder Sentence
Encode RNN decoder Sentence [Venugopalan et. al.
NAACL’15]
RNN decoder Sentence [Venugopalan et. al. ICCV’
15] (this work)
RNN encoder
![Page 19: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/19.jpg)
S2VT Overview
LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM
LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM
CNN CNN CNN CNN
A man is talking
...
...Encoding stage
Decoding stage
Now decode it to a sentence!
Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko
![Page 20: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/20.jpg)
Frames: RGB
CNN
1000 categories
CNN
Forward propagateOutput: “fc7” features
(activations before classification layer)
fc7: 4096 dimension “feature vector”
1. Train on Imagenet
2. Take activations from layer before classification
![Page 21: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/21.jpg)
Frames: Flow
CNN (modified AlexNet)
101 Action Classes
CNNForward propagate
Output: “fc7” features(activations before classification layer)
fc7: 4096 dimension “feature vector”
1. Train CNN on Activity classes
3. Take activations from layer before classification
2. Use optical flow to extract flow images.
UCF 101
[T. Brox et. al. ECCV ‘04]
![Page 22: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/22.jpg)
Dataset: Youtube
● A man is walking on a rope.● A man is walking across a rope.● A man is balancing on a rope.● A man is balancing on a rope at the beach.● A man walks on a tightrope at the beach.● A man is balancing on a volleyball net.● A man is walking on a rope held by poles● A man balanced on a wire.● The man is balancing on the wire.● A man is walking on a rope.● A man is standing in the sea shore.
● ~2000 clips● Avg. length: 11s per clip● ~40 sentence per clip● ~81,000 sentences
![Page 23: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/23.jpg)
Results (Youtube)
27.7
28.2
29.2
29.8
Mean-Pool (VGG)
S2VT (RGB+Flow)
S2VT (randomized)
S2VT (RGB)
METEOR: MT metric. Considers alignment, para-phrases and similarity.
![Page 24: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/24.jpg)
![Page 25: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/25.jpg)
Evaluation: Movie Corpora
MPII-MD
● MPII, Germany● DVS alignment: semi-
automated and crowdsourced● 94 movies● 68,000 clips● Avg. length: 3.9s per clip● ~1 sentence per clip● 68,375 sentences
M-VAD
● Univ. of Montreal● DVS alignment: automated
speech extraction● 92 movies● 46,009 clips● Avg. length: 6.2s per clip● 1-2 sentences per clip● 56,634 sentences
![Page 26: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/26.jpg)
Movie Corpus - DVS
Processed: Looking troubled, someone descends the stairs.
Someone rushes into the courtyard. She then puts a head scarf on ...
![Page 27: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/27.jpg)
Results (MPII-MD Movie Corpus)
5.6
6.7
7.1
Best Prior Work[Rohrbach et al. CVPR’15]
Mean-Pool
S2VT (RGB)
![Page 28: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/28.jpg)
Results (M-VAD Movie Corpus)
4.3
6.1
6.7
Best Prior Work[Yao et al. ICCV’15]
Mean-Pool
S2VT (RGB)
![Page 29: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/29.jpg)
M-VAD: https://youtu.be/pER0mjzSYaM
![Page 30: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/30.jpg)
● What are the advantages/drawbacks of this approach?○ End-to-end, annotations
● Detaching recognition and generation.● Why only METEOR (not BLEU or other metrics)?● Domain adaptation, Re-use RNNs (youtube -> movies, activity recognition)● Languages other than English.● Features apart from Optical Flow, RGB; temporal representation.
Discussion
Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko
![Page 31: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/31.jpg)
![Page 32: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM](https://reader033.vdocuments.mx/reader033/viewer/2022060508/5f230c7d0564a3717601b25e/html5/thumbnails/32.jpg)
Code and more exampleshttp://vsubhashini.github.io/s2vt.html
Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko