structured attention networks - harvard nlp · 2020. 6. 3. · 1 attention distribution...
TRANSCRIPT
![Page 1: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/1.jpg)
Structured Attention Networks
Yoon Kim∗ Carl Denton∗ Luong Hoang Alexander M. Rush
HarvardNLP
![Page 2: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/2.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 3: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/3.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 4: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/4.jpg)
Pure Encoder-Decoder Network
Input (sentence, image, etc.)
Fixed-Size Encoder (MLP, RNN, CNN)
Encoder(input) ∈ RD
Decoder
Decoder(Encoder(input))
![Page 5: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/5.jpg)
Pure Encoder-Decoder Network
Input (sentence, image, etc.)
Fixed-Size Encoder (MLP, RNN, CNN)
Encoder(input) ∈ RD
Decoder
Decoder(Encoder(input))
![Page 6: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/6.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 7: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/7.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 8: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/8.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 9: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/9.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 10: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/10.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 11: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/11.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 12: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/12.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 13: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/13.jpg)
Example: Neural Machine Translation (Sutskever et al., 2014)
![Page 14: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/14.jpg)
Encoder-Decoder with Attention
Machine Translation (Bahdanau et al., 2015; Luong et al., 2015)
Question Answering (Hermann et al., 2015; Sukhbaatar et al., 2015)
Natural Language Inference (Rocktaschel et al., 2016; Parikh et al., 2016)
Algorithm Learning (Graves et al., 2014, 2016; Vinyals et al., 2015a)
Parsing (Vinyals et al., 2015b)
Speech Recognition (Chorowski et al., 2015; Chan et al., 2015)
Summarization (Rush et al., 2015)
Caption Generation (Xu et al., 2015)
and more...
![Page 15: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/15.jpg)
Neural Attention
Input (sentence, image, etc.)
Memory-Bank Encoder (MLP, RNN, CNN)
Encoder(input) = x1, x2, . . . , xT
Attention Distribution Context Vector
“where” “what”
Decoder
![Page 16: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/16.jpg)
Neural Attention
Input (sentence, image, etc.)
Memory-Bank Encoder (MLP, RNN, CNN)
Encoder(input) = x1, x2, . . . , xT
Attention Distribution Context Vector
“where” “what”
Decoder
![Page 17: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/17.jpg)
Neural Attention
Input (sentence, image, etc.)
Memory-Bank Encoder (MLP, RNN, CNN)
Encoder(input) = x1, x2, . . . , xT
Attention Distribution Context Vector
“where” “what”
Decoder
![Page 18: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/18.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 19: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/19.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 20: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/20.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 21: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/21.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 22: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/22.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 23: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/23.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 24: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/24.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 25: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/25.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 26: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/26.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 27: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/27.jpg)
Attention-based Neural Machine Translation (Bahdanau et al., 2015)
![Page 28: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/28.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 29: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/29.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 30: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/30.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 31: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/31.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 32: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/32.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 33: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/33.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 34: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/34.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 35: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/35.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 36: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/36.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 37: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/37.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 38: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/38.jpg)
Question Answering (Sukhbaatar et al., 2015)
![Page 39: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/39.jpg)
Other Applications: Image Captioning (Xu et al., 2015)
![Page 40: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/40.jpg)
Other Applications: Speech Recognition (Chan et al., 2015)
![Page 41: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/41.jpg)
Applications From HarvardNLP: Summarization (Rush et al., 2015)
![Page 42: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/42.jpg)
Applications From HarvardNLP: Image-to-Latex (Deng et al., 2016)
![Page 43: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/43.jpg)
Applications From HarvardNLP: OpenNMT
![Page 44: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/44.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 45: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/45.jpg)
Attention Networks: Notation
x1, . . . , xT Memory bank
q Query
z Memory selection (random variable)
p(z |x, q; θ) Attention distribution (“where”)
f(x, z) Annotation function (“what”)
c = Ez |x,q[f(x, z)] Context Vector
End-to-End Requirements:
1 Need to compute attention p(z = i |x, q; θ)2 Need to backpropagate to learn parameters θ
![Page 46: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/46.jpg)
Attention Networks: Notation
x1, . . . , xT Memory bank
q Query
z Memory selection (random variable)
p(z |x, q; θ) Attention distribution (“where”)
f(x, z) Annotation function (“what”)
c = Ez |x,q[f(x, z)] Context Vector
End-to-End Requirements:
1 Need to compute attention p(z = i |x, q; θ)2 Need to backpropagate to learn parameters θ
![Page 47: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/47.jpg)
Attention Networks: Machine Translation
x1, . . . , xT Memory bank Source RNN hidden states
q Query Decoder hidden state
z Memory selection Source position {1, . . . , T}p(z = i |x, q; θ) Attention distribution softmax(x>i q)
f(x, z) Annotation function Memory at time z, i.e. xz
c = E[f(x, z)] Context Vector
End-to-End Requirements:
1 Need to compute attention p(z = i |x, q; θ)=⇒ softmax function
2 Need to backpropagate to learn parameters θ
=⇒ Backprop through softmax function
![Page 48: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/48.jpg)
Attention Networks: Machine Translation
![Page 49: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/49.jpg)
Attention Networks: Machine Translation
![Page 50: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/50.jpg)
Attention Networks: Machine Translation
![Page 51: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/51.jpg)
Attention Networks: Machine Translation
![Page 52: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/52.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 53: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/53.jpg)
Structured Attention Networks
Replace simple attention with distribution over a combinatorial set
of structures
Attention distribution represented with graphical model over
multiple latent variables
Compute attention using embedded inference .
New Model
p(z |x, q; θ) Attention distribution over structures z
![Page 54: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/54.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 55: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/55.jpg)
Structured Attention Networks
![Page 56: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/56.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 57: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/57.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 58: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/58.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 59: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/59.jpg)
Motivation: Structured Output Prediction
Modeling the structured output (i.e. graphical model on top of a
neural net) has improved performance (LeCun et al., 1998; Lafferty et al.,
2001; Collobert et al., 2011)
Given a sequence x = x1, . . . , xT
Factored potentials θi,i+1(zi, zi+1;x)
p(z |x; θ) = softmax( T−1∑i=1
θi,i+1(zi, zi+1;x))
=1
Zexp
( T−1∑i=1
θi,i+1(zi, zi+1;x))
![Page 60: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/60.jpg)
Neural CRF for Sequence Tagging (Collobert et al., 2011)
Factored potentials θ come from neural network.
![Page 61: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/61.jpg)
Inference in Linear-Chain CRF
Fast algorithms for computing p(zi|x)
![Page 62: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/62.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 63: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/63.jpg)
Structured Attention Networks: Notation
x1, . . . , xT Memory bank -
q Query -
z1, . . . , zT Memory selection Selection over structures
p(zi |x, q; θ) Attention distribution Marginal distributions
f(x, z) Annotation function Neural representation
![Page 64: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/64.jpg)
Challenge: End-to-End Training
Requirements:
1 Compute attention distribution (marginals) p(zi |x, q; θ)=⇒ Forward-backward algorithm
2 Gradients wrt attention distribution parameters θ.
=⇒ Backpropagation through forward-backward algorithm
![Page 65: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/65.jpg)
Challenge: End-to-End Training
Requirements:
1 Compute attention distribution (marginals) p(zi |x, q; θ)=⇒ Forward-backward algorithm
2 Gradients wrt attention distribution parameters θ.
=⇒ Backpropagation through forward-backward algorithm
![Page 66: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/66.jpg)
Challenge: End-to-End Training
Requirements:
1 Compute attention distribution (marginals) p(zi |x, q; θ)=⇒ Forward-backward algorithm
2 Gradients wrt attention distribution parameters θ.
=⇒ Backpropagation through forward-backward algorithm
![Page 67: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/67.jpg)
Review: Forward-Backward Algorithm in Practice
θ: input potentials (e.g. from NN)
α, β: dynamic programming tables
procedure StructAttention(θ)
Forward
for i = 1, . . . , n; zi do
α[i, zi]←∑
zi−1α[i− 1, zi−1]× exp(θi−1,i(zi−1, zi))
Backward
for i = n, . . . , 1; zi do
β[i, zi]←∑
zi+1ı β[i+ 1, zi+1]× exp(θi,i+1(zi, zi+1))
![Page 68: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/68.jpg)
Forward-Backward Algorithm (Log-Space Semiring Trick)
θ: input potentials (e.g. from MLP or parameters)
x⊕ y = log(exp(x) + exp(y))
x⊗ y = x+ y
procedure StructAttention(θ)
Forward
for i = 1, . . . , n; zi do
α[i, zi]←⊕
zi−1 α[i− 1, y]⊗ θi−1,i(zi−1, zi)
Backward
for i = n, . . . , 1; zi do
β[i, zi]←⊕
zi+1 β[i+ 1, zi+1]⊗ θi,i+1(zi, zi+1)
![Page 69: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/69.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 70: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/70.jpg)
Backpropagating through Forward-Backward
∇Lp : Gradient of arbitrary loss L with respect to marginals p
procedure BackpropStructAtten(θ, p,∇Lα,∇Lβ )
Backprop Backward
for i = n, . . . 1; zi do
β[i, zi]← ∇Lα[i, zi]⊕⊕
zi+1θi,i+1(zi, zi+1)⊗ β[i+ 1, zi+1]
Backprop Forward
for i = 1, . . . , n; zi do
α[i, zi]← ∇Lβ [i, zi]⊕⊕
zi−1θi−1,i(zi−1, zi)⊗ α[i− 1, zi−1]
Potential Gradients
for i = 1, . . . , n; zi, zi+1 do
∇Lθi−1,i(zi,zi+1) ← signexp(α[i, zi]⊗ β[i+ 1, zi+1]⊕ α[i, zi]⊗β[i+ 1, zi+1]⊕ α[i, zi]⊗ β[i+ 1, zi+1]⊗−A)
![Page 71: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/71.jpg)
Interesting Issue: Negative Gradients Through Attention
∇Lp : Gradient could be negative, but working in log-space!
Signed Log-space semifield Trick (Li and Eisner, 2009)
Use tuples (la, sa) where la = log |a| and sa = sign(a)
⊕sa sb la+b sa+b
+ + la + log(1 + d) +
+ − la + log(1− d) +
− + la + log(1− d) −− − la + log(1 + d) −
(Similar rules for ⊗)
![Page 72: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/72.jpg)
Structured Attention Networks for Neural Machine Translation
![Page 73: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/73.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 74: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/74.jpg)
Implementation
(http://github.com/harvardnlp/struct-attn))
General-purpose structured attention unit.
All dynamic programming is GPU optimized for speed.
Additionally supports pairwise potentials and marginals.
NLP Experiments
Machine Translation
Question Answering
Natural Language Inference
![Page 75: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/75.jpg)
Segmental-Attention for Neural Machine Translation
Use segmentation CRF for attention, i.e. binary vectors of length n
p(z1, . . . , zT |x, q) parameterized with a linear-chain CRF.
Neural “phrase-based” translation.
Unary potentials (Encoder RNN):
θi(k) =
xiWq, k = 1
0, k = 0
Pairwise potentials (Simple Parameters):
4 additional binary parameters (i.e., b0,0, b0,1, b1,0, b1,1)
![Page 76: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/76.jpg)
Neural Machine Translation Experiments
Data:
Japanese → English (from WAT 2015)
Traditionally, word segmentation as a preprocessing step
Use structured attention learn an implicit segmentation model
Experiments:
Japanese characters → English words
Japanese words → English words
![Page 77: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/77.jpg)
Neural Machine Translation Experiments
Simple Sigmoid Structured
Char → Word 12.6 13.1 14.6
Word → Word 14.1 13.8 14.3
BLEU scores on test set (higher is better).
Models:
Simple softmax attention
Sigmoid attention
Structured attention
![Page 78: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/78.jpg)
Attention Visualization: Ground Truth
![Page 79: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/79.jpg)
Attention Visualization: Simple Attention
![Page 80: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/80.jpg)
Attention Visualization: Sigmoid Attention
![Page 81: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/81.jpg)
Attention Visualization: Structured Attention
![Page 82: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/82.jpg)
Simple Non-Factoid Question Answering
Simple attention: Greedy soft-selection of K supporting facts
![Page 83: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/83.jpg)
Structured Attention Networks for Question Answering
Structured attention: Consider all possible sequences
![Page 84: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/84.jpg)
Structured Attention Networks for Question Answering
baBi tasks (Weston et al., 2015): 1k questions per task
Simple Structured
Task K Ans % Fact % Ans % Fact %
Task 02 2 87.3 46.8 84.7 81.8
Task 03 3 52.6 1.4 40.5 0.1
Task 11 2 97.8 38.2 97.7 80.8
Task 13 2 95.6 14.8 97.0 36.4
Task 14 2 99.9 77.6 99.7 98.2
Task 15 2 100.0 59.3 100.0 89.5
Task 16 3 97.1 91.0 97.9 85.6
Task 17 2 61.1 23.9 60.6 49.6
Task 18 2 86.4 3.3 92.2 3.9
Task 19 2 21.3 10.2 24.4 11.5
Average − 81.4 39.6 81.0 53.7
![Page 85: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/85.jpg)
Visualization of Structured Attention
![Page 86: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/86.jpg)
Natural Language Inference
Given a premise (P) and a hypothesis (H), predict the relationship:
Entailment (E), Contradiction (C), Neutral (N)
$ A boy is running outside .
Many existing models run parsing as a preprocessing step and attend
over parse trees.
![Page 87: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/87.jpg)
Neural CRF Parsing (Durrett and Klein, 2015; Kipperwasser and Goldberg, 2016)
![Page 88: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/88.jpg)
Neural CRF Parsing (Durrett and Klein, 2015; Kipperwasser and Goldberg, 2016)
![Page 89: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/89.jpg)
Syntactic Attention Network
1 Attention distribution (probability of a parse tree)
=⇒ Inside/outside algorithm
2 Gradients wrt attention distribution parameters: ∂L∂θ
=⇒ Backpropagation through inside/outside algorithm
Forward/backward pass on inside-outside version of Eisner’s algorithm
(Eisner, 1996) takes O(T 3) time.
![Page 90: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/90.jpg)
Syntactic Attention Network
1 Attention distribution (probability of a parse tree)
=⇒ Inside/outside algorithm
2 Gradients wrt attention distribution parameters: ∂L∂θ
=⇒ Backpropagation through inside/outside algorithm
Forward/backward pass on inside-outside version of Eisner’s algorithm
(Eisner, 1996) takes O(T 3) time.
![Page 91: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/91.jpg)
Syntactic Attention Network
1 Attention distribution (probability of a parse tree)
=⇒ Inside/outside algorithm
2 Gradients wrt attention distribution parameters: ∂L∂θ
=⇒ Backpropagation through inside/outside algorithm
Forward/backward pass on inside-outside version of Eisner’s algorithm
(Eisner, 1996) takes O(T 3) time.
![Page 92: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/92.jpg)
Syntactic Attention Network
1 Attention distribution (probability of a parse tree)
=⇒ Inside/outside algorithm
2 Gradients wrt attention distribution parameters: ∂L∂θ
=⇒ Backpropagation through inside/outside algorithm
Forward/backward pass on inside-outside version of Eisner’s algorithm
(Eisner, 1996) takes O(T 3) time.
![Page 93: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/93.jpg)
Backpropagation through Inside-Outside Algorithm
![Page 94: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/94.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 95: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/95.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 96: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/96.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 97: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/97.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 98: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/98.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 99: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/99.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 100: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/100.jpg)
Structured Attention Networks with a Parser (“Syntactic Attention”)
![Page 101: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/101.jpg)
Structured Attention Networks for Natural Language Inference
Dataset: Stanford Natural Language Inference (Bowman et al., 2015)
Model Accuracy %
No Attention 85.8
Hard parent 86.1
Simple Attention 86.2
Structured Attention 86.8
No attention: word embeddings only
“Hard” parent from a pipelined dependency parser
Simple attention (simple softmax instead of syntanctic attention)
Structured attention (soft parents from syntactic attention)
![Page 102: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/102.jpg)
Structured Attention Networks for Natural Language Inference
Run Viterbi algorithm on the parsing layer to get the MAP parse:
z = argmaxz
p(z |x, q)
$ The men are fighting outside a deli .
![Page 103: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/103.jpg)
1 Deep Neural Networks for Text Processing and Generation
2 Attention Networks
3 Structured Attention Networks
Computational Challenges
Structured Attention In Practice
4 Conclusion and Future Work
![Page 104: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/104.jpg)
Structured Attention Networks
Generalize attention to incorporate latent structure
Exact inference through dynamic programming
Training remains end-to-end.
Future work
Approximate differentiable inference in neural networks
Incorporate other probabilistic models into deep learning.
Compare further to methods using EM or hard structures.
![Page 105: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/105.jpg)
Other Work: Lie-Access Neural Memory (Yang and Rush, 2017)
![Page 106: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/106.jpg)
References I
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation
by Jointly Learning to Align and Translate. In Proceedings of ICLR.
Bowman, S. R., Manning, C. D., and Potts, C. (2015). Tree-Structured
Composition in Neural Networks without Tree-Structured Architectures. In
Proceedings of the NIPS workshop on Cognitive Computation: Integrating
Neural and Symbolic Approaches.
Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2015). Listen, Attend and Spell.
arXiv:1508.01211.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015).
Attention-Based Models for Speech Recognition. In Proceedings of NIPS.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and
Kuksa, P. (2011). Natural Language Processing (almost) from Scratch.
Journal of Machine Learning Research, 12:2493–2537.
![Page 107: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/107.jpg)
References II
Deng, Y., Kanervisto, A., and Rush, A. M. (2016). What You Get Is What
You See: A Visual Markup Decompiler. arXiv:1609.04938.
Durrett, G. and Klein, D. (2015). Neural CRF Parsing. In Proceedings of ACL.
Eisner, J. M. (1996). Three New Probabilistic Models for Dependency Parsing:
An Exploration. In Proceedings of ACL.
Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing Machines.
arXiv:1410.5401.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I.,
Grabska-Barwinska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T.,
Agapiou, J., Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain,
A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis,
D. (2016). Hybrid Computing Using a Neural Network with Dynamic
External Memory. Nature.
![Page 108: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/108.jpg)
References III
Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W.,
Suleyman, M., and Blunsom, P. (2015). Teaching Machines to Read and
Comprehend. In Proceedings of NIPS.
Kipperwasser, E. and Goldberg, Y. (2016). Simple and Accurate Dependency
Parsing using Bidirectional LSTM Feature Representations. In TACL.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data. In
Proceedings of ICML.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based
Learning Applied to Document Recognition. In Proceedings of IEEE.
Li, Z. and Eisner, J. (2009). First- and Second-Order Expectation Semirings
with Applications to Minimum-Risk Training on Translation Forests. In
Proceedings of EMNLP 2009.
![Page 109: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/109.jpg)
References IV
Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective Approaches to
Attention-based Neural Machine Translation. In Proceedings of EMNLP.
Parikh, A. P., Tackstrom, O., Das, D., and Uszkoreit, J. (2016). A
Decomposable Attention Model for Natural Language Inference. In
Proceedings of EMNLP.
Rocktaschel, T., Grefenstette, E., Hermann, K. M., Kocisky, T., and Blunsom,
P. (2016). Reasoning about Entailment with Neural Attention. In
Proceedings of ICLR.
Rush, A. M., Chopra, S., and Weston, J. (2015). A Neural Attention Model
for Abstractive Sentence Summarization. In Proceedings of EMNLP.
Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). End-To-End
Memory Networks. In Proceedings of NIPS.
Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to Sequence Learning
with Neural Networks. In Proceedings of NIPS.
![Page 110: Structured Attention Networks - Harvard NLP · 2020. 6. 3. · 1 Attention distribution (probability of a parse tree) =)Inside/outside algorithm 2 Gradients wrt attention distribution](https://reader036.vdocuments.mx/reader036/viewer/2022071010/5fc794bdd6f53e3b665aea49/html5/thumbnails/110.jpg)
References V
Vinyals, O., Fortunato, M., and Jaitly, N. (2015a). Pointer Networks. In
Proceedings of NIPS.
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G.
(2015b). Grammar as a Foreign Language. In Proceedings of NIPS.
Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merrienboer, B., Joulin,
A., and Mikolov, T. (2015). Towards Ai-complete Question Answering: A
Set of Prerequisite Toy Tasks. arXiv preprint arXiv:1502.05698.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.,
and Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention. In Proceedings of ICML.