deep learning for natural language...
TRANSCRIPT
![Page 1: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/1.jpg)
Deep Learning for Natural LanguageProcessingJindřich Helcl
April 14, 2020
NPFL124 Introduction to Natural Language Processing
Charles UniversityFaculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics unless otherwise stated
![Page 2: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/2.jpg)
Outline
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 1/90
![Page 3: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/3.jpg)
Deep Learning in NLP
• NLP tasks learn end-to-end using deep learning — the number-one approach in currentresearch
• State of the art in POS tagging, parsing, named-entity recognition, machine translation,…
• Good news: training without almost any linguistic insight• Bad news: requires enormous amount of training data and really big computational
power
Deep Learning for Natural Language Processing 2/90
![Page 4: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/4.jpg)
Deep Learning in NLP
• NLP tasks learn end-to-end using deep learning — the number-one approach in currentresearch
• State of the art in POS tagging, parsing, named-entity recognition, machine translation,…
• Good news: training without almost any linguistic insight• Bad news: requires enormous amount of training data and really big computational
power
Deep Learning for Natural Language Processing 2/90
![Page 5: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/5.jpg)
Deep Learning in NLP
• NLP tasks learn end-to-end using deep learning — the number-one approach in currentresearch
• State of the art in POS tagging, parsing, named-entity recognition, machine translation,…
• Good news: training without almost any linguistic insight
• Bad news: requires enormous amount of training data and really big computationalpower
Deep Learning for Natural Language Processing 2/90
![Page 6: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/6.jpg)
Deep Learning in NLP
• NLP tasks learn end-to-end using deep learning — the number-one approach in currentresearch
• State of the art in POS tagging, parsing, named-entity recognition, machine translation,…
• Good news: training without almost any linguistic insight• Bad news: requires enormous amount of training data and really big computational
power
Deep Learning for Natural Language Processing 2/90
![Page 7: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/7.jpg)
What is deep learning?
• Buzzword for machine learning using neural networks with many layers usingback-propagation
• Learning of a real-valued function with millions of parameters that solves a particularproblem
• Learning more and more abstract representation of the input data until we reach such asuitable representation for our problem
Deep Learning for Natural Language Processing 3/90
![Page 8: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/8.jpg)
What is deep learning?
• Buzzword for machine learning using neural networks with many layers usingback-propagation
• Learning of a real-valued function with millions of parameters that solves a particularproblem
• Learning more and more abstract representation of the input data until we reach such asuitable representation for our problem
Deep Learning for Natural Language Processing 3/90
![Page 9: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/9.jpg)
What is deep learning?
• Buzzword for machine learning using neural networks with many layers usingback-propagation
• Learning of a real-valued function with millions of parameters that solves a particularproblem
• Learning more and more abstract representation of the input data until we reach such asuitable representation for our problem
Deep Learning for Natural Language Processing 3/90
![Page 10: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/10.jpg)
Neural Networks Basics
![Page 11: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/11.jpg)
Neural Networks Basics
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 4/90
![Page 12: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/12.jpg)
Single Neuron
activationfunction
input xweights w
output
∑ ⋅ is > 0?𝑥𝑖 ⋅𝑤𝑖
𝑥1
⋅𝑤1𝑥2⋅𝑤2��
�
𝑥𝑛
⋅𝑤𝑛
Deep Learning for Natural Language Processing 5/90
![Page 13: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/13.jpg)
Neural Network
𝑥↓ ↑ ↓ ↑
ℎ1 = 𝑓(𝑊1𝑥 + 𝑏1)↓ ↑ ↓ ↑
ℎ2 = 𝑓(𝑊2ℎ1 + 𝑏2)↓ ↑ ↓ ↑⋮ ⋮
↓ ↑ ↓ ↑ℎ𝑛 = 𝑓(𝑊𝑛ℎ𝑛−1 + 𝑏𝑛)
↓ ↑ ↓ ↑𝑜 = 𝑔(𝑊𝑜ℎ𝑛 + 𝑏𝑜) ∂𝐸
∂𝑊𝑜= ∂𝐸
∂𝑜 ⋅ ∂𝑜∂𝑊𝑜
↓ ↓ ↑𝐸 = 𝑒(𝑜, 𝑡) → ∂𝐸
∂𝑜
Deep Learning for Natural Language Processing 6/90
![Page 14: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/14.jpg)
Implementation
Logistic regression:
𝑦 = 𝜎 (𝑊𝑥 + 𝑏) (1)
Computation graph:
𝑥
𝑊×
𝑏
+ 𝜎ℎ
forward graph
loss
𝑦∗
𝑜 𝜎′𝑜′+
𝑏′
ℎ′×
𝑊 ′
backward graph
Deep Learning for Natural Language Processing 7/90
![Page 15: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/15.jpg)
Implementation
Logistic regression:
𝑦 = 𝜎 (𝑊𝑥 + 𝑏) (1)
Computation graph:
𝑥
𝑊×
𝑏
+ 𝜎ℎ
forward graph
loss
𝑦∗
𝑜
𝜎′𝑜′+
𝑏′
ℎ′×
𝑊 ′
backward graph
Deep Learning for Natural Language Processing 7/90
![Page 16: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/16.jpg)
Implementation
Logistic regression:
𝑦 = 𝜎 (𝑊𝑥 + 𝑏) (1)
Computation graph:
𝑥
𝑊×
𝑏
+ 𝜎ℎ
forward graph
loss
𝑦∗
𝑜 𝜎′𝑜′+
𝑏′
ℎ′×
𝑊 ′
backward graph
Deep Learning for Natural Language Processing 7/90
![Page 17: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/17.jpg)
Frameworks for Deep Learning
research and prototyping in Python
• graph statically constructed,symbolic computation
• computation happens in a session• allows graph export and running as a
binary
• computations written dynamically asnormal procedural code
• easy debugging: inspecting variablesat any time of the computation
Deep Learning for Natural Language Processing 8/90
![Page 18: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/18.jpg)
Representing Words
![Page 19: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/19.jpg)
Representing Words
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 9/90
![Page 20: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/20.jpg)
Language Modeling
• estimate probability of a next word in a text
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤1)
• standard approach: 𝑛-gram models with Markov assumption
≈ P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈𝑛
∑𝑗=0
𝜆𝑗𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗)
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)• Let’s simulate it with a neural network:
… ≈ 𝐹(𝑤𝑖−1, … , 𝑤𝑖−𝑛|𝜃)𝜃 is a set of trainable parameters.
Deep Learning for Natural Language Processing 10/90
![Page 21: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/21.jpg)
Language Modeling
• estimate probability of a next word in a text
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤1)• standard approach: 𝑛-gram models with Markov assumption
≈ P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈𝑛
∑𝑗=0
𝜆𝑗𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗)
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)
• Let’s simulate it with a neural network:
… ≈ 𝐹(𝑤𝑖−1, … , 𝑤𝑖−𝑛|𝜃)𝜃 is a set of trainable parameters.
Deep Learning for Natural Language Processing 10/90
![Page 22: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/22.jpg)
Language Modeling
• estimate probability of a next word in a text
P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤1)• standard approach: 𝑛-gram models with Markov assumption
≈ P(𝑤𝑖|𝑤𝑖−1, 𝑤𝑖−2, … , 𝑤𝑖−𝑛) ≈𝑛
∑𝑗=0
𝜆𝑗𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗)
𝑐(𝑤𝑖|𝑤𝑖−1, … , 𝑤𝑖−𝑗+1)• Let’s simulate it with a neural network:
… ≈ 𝐹(𝑤𝑖−1, … , 𝑤𝑖−𝑛|𝜃)𝜃 is a set of trainable parameters.
Deep Learning for Natural Language Processing 10/90
![Page 23: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/23.jpg)
Simple Neural Language Model
1𝑤𝑛−3
⋅𝑊𝑒
1𝑤𝑛−2
⋅𝑊𝑒
1𝑤𝑛−1
⋅𝑊𝑒
tanh
⋅𝑉3 ⋅𝑉2 ⋅𝑉1 + 𝑏ℎ
softmax
⋅𝑊 + 𝑏
P(𝑤𝑛|𝑤𝑛−1, 𝑤𝑛−2, 𝑤𝑛−3)
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3(Feb):1137–1155, 2003. ISSN 1532-4435
Deep Learning for Natural Language Processing 11/90
![Page 24: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/24.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words
• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉• 𝑉 : is a table of learned word vector representations
so-called word embeddings• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 25: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/25.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)
• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉• 𝑉 : is a table of learned word vector representations
so-called word embeddings• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 26: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/26.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉
• 𝑉 : is a table of learned word vector representationsso-called word embeddings
• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 27: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/27.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉• 𝑉 : is a table of learned word vector representations
so-called word embeddings
• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 28: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/28.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉• 𝑉 : is a table of learned word vector representations
so-called word embeddings• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 29: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/29.jpg)
Neural LM: Word Representation
• limited vocabulary (hundred thousands words): indexed set of words• words are initially represented as one-hot-vectors 1𝑤 = (0, … , 0, 1, 0, … 0)• projection 1𝑤 ⋅ 𝑉 corresponds to selecting one row from matrix 𝑉• 𝑉 : is a table of learned word vector representations
so-called word embeddings• dimension typically 100 — 300
The first hidden layer is then:
ℎ1 = 𝑉𝑤𝑖−𝑛⊕ 𝑉𝑤𝑖−𝑛+1
⊕ … ⊕ 𝑉𝑤𝑖−1
Matrix 𝑉 is shared for all words.
Deep Learning for Natural Language Processing 12/90
![Page 30: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/30.jpg)
Neural LM: Next Word Estimation
• optionally add extra hidden layer:
ℎ2 = 𝑓(ℎ1𝑊1 + 𝑏1)
• last layer: probability distribution over vocabulary
𝑦 = softmax(ℎ2𝑊2 + 𝑏2) = exp(ℎ2𝑊2 + 𝑏2)∑ exp(ℎ2𝑊2 + 𝑏2)
• training objective: cross-entropy between the true (i.e., one-hot) distribution andestimated distribution
𝐸 = − ∑𝑖
𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑𝑖
− log 𝑦(𝑤𝑖)
• learned by error back-propagation
Deep Learning for Natural Language Processing 13/90
![Page 31: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/31.jpg)
Neural LM: Next Word Estimation
• optionally add extra hidden layer:
ℎ2 = 𝑓(ℎ1𝑊1 + 𝑏1)
• last layer: probability distribution over vocabulary
𝑦 = softmax(ℎ2𝑊2 + 𝑏2) = exp(ℎ2𝑊2 + 𝑏2)∑ exp(ℎ2𝑊2 + 𝑏2)
• training objective: cross-entropy between the true (i.e., one-hot) distribution andestimated distribution
𝐸 = − ∑𝑖
𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑𝑖
− log 𝑦(𝑤𝑖)
• learned by error back-propagation
Deep Learning for Natural Language Processing 13/90
![Page 32: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/32.jpg)
Neural LM: Next Word Estimation
• optionally add extra hidden layer:
ℎ2 = 𝑓(ℎ1𝑊1 + 𝑏1)
• last layer: probability distribution over vocabulary
𝑦 = softmax(ℎ2𝑊2 + 𝑏2) = exp(ℎ2𝑊2 + 𝑏2)∑ exp(ℎ2𝑊2 + 𝑏2)
• training objective: cross-entropy between the true (i.e., one-hot) distribution andestimated distribution
𝐸 = − ∑𝑖
𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑𝑖
− log 𝑦(𝑤𝑖)
• learned by error back-propagation
Deep Learning for Natural Language Processing 13/90
![Page 33: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/33.jpg)
Neural LM: Next Word Estimation
• optionally add extra hidden layer:
ℎ2 = 𝑓(ℎ1𝑊1 + 𝑏1)
• last layer: probability distribution over vocabulary
𝑦 = softmax(ℎ2𝑊2 + 𝑏2) = exp(ℎ2𝑊2 + 𝑏2)∑ exp(ℎ2𝑊2 + 𝑏2)
• training objective: cross-entropy between the true (i.e., one-hot) distribution andestimated distribution
𝐸 = − ∑𝑖
𝑝true(𝑤𝑖) log 𝑦(𝑤𝑖) = ∑𝑖
− log 𝑦(𝑤𝑖)
• learned by error back-propagation
Deep Learning for Natural Language Processing 13/90
![Page 34: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/34.jpg)
Learned Representations
• word embeddings from LMs have interesting properties
• cluster according to POS & meaning similarity
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing
(almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
• in IR: query expansion by nearest neighbors• in deep learning models: embeddings initialization speeds up training / allows complex
model with less data
Deep Learning for Natural Language Processing 14/90
![Page 35: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/35.jpg)
Learned Representations
• word embeddings from LMs have interesting properties• cluster according to POS & meaning similarity
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing
(almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
• in IR: query expansion by nearest neighbors• in deep learning models: embeddings initialization speeds up training / allows complex
model with less data
Deep Learning for Natural Language Processing 14/90
![Page 36: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/36.jpg)
Learned Representations
• word embeddings from LMs have interesting properties• cluster according to POS & meaning similarity
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing
(almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
• in IR: query expansion by nearest neighbors
• in deep learning models: embeddings initialization speeds up training / allows complexmodel with less data
Deep Learning for Natural Language Processing 14/90
![Page 37: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/37.jpg)
Learned Representations
• word embeddings from LMs have interesting properties• cluster according to POS & meaning similarity
Table taken from Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing
(almost) from scratch. The Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. ISSN 1533-7928
• in IR: query expansion by nearest neighbors• in deep learning models: embeddings initialization speeds up training / allows complex
model with less dataDeep Learning for Natural Language Processing 14/90
![Page 38: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/38.jpg)
Implementation in PyTorch I
import torchimport torch.nn as nn
class LanguageModel(nn.Module):def __init__(self, vocab_size, embedding_dim, hidden_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)self.hidden_layer = nn.Linear(3 * embedding_dim, hidden_dim)self.output_layer = nn.Linear(hidden_dim, vocab_size)self.loss_function = nn.CrossEntropyLoss()
def forward(self, word_1, word_2, word_3, target=None):embedded_1 = self.embedding(word_1)embedded_2 = self.embedding(word_2)embedded_3 = self.embedding(word_3)
Deep Learning for Natural Language Processing 15/90
![Page 39: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/39.jpg)
Implementation in PyTorch II
hidden = torch.tanh(self.hidden_layer(torch.cat(embedded_1, embedded_2, embedded_3)))
logits = self.output_layer(hidden)
loss = Noneif target is not None:
loss = self.loss_function(logits, targets)
return logits, loss
Deep Learning for Natural Language Processing 16/90
![Page 40: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/40.jpg)
Implementation in TensorFlow I
import tensorfow as tf
input_words = [tf.placeholder(tf.int32, shape=[None]) for _ in range(3)]target_word = tf.placeholder(tf.int32, shape[None])
embeddings = tf.get_variable(tf.float32, shape=[vocab_size, emb_dim])embedded_words = tf.concat([tf.nn.embedding_lookup(w) for w in input_words])
hidden_layer = tf.layers.dense(embedded_words, hidden_size, activation=tf.tanh)output_layer = tf.layers.dense(hidden_layer, vocab_size, activation=None)output_probabilities = tf.nn.softmax(output_layer)
loss = tf.nn.cross_entropy_with_logits(output_layer, target_words)
optimizer = tf.optimizers.AdamOptimizers()train_op = optimizer.minimize(loss)
Deep Learning for Natural Language Processing 17/90
![Page 41: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/41.jpg)
Implementation in TensorFlow II
session = tf.Session()# initialize variables
Training given batch
_, loss_value = session.run([train_op, loss], feed_dict={input_words[0]: ..., input_words[1]: ..., input_words[2]: ...,target_word: ...
})
Inference given batch
probs = session.run(output_probabilities, feed_dict={input_words[0]: ..., input_words[1]: ..., input_words[2]: ...,
})
Deep Learning for Natural Language Processing 18/90
![Page 42: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/42.jpg)
Representing Sequences
![Page 43: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/43.jpg)
Representing Sequences
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 19/90
![Page 44: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/44.jpg)
Representing Sequences
Recurrent Networks
![Page 45: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/45.jpg)
Recurrent Networks (RNNs)
…the default choice for sequence labeling
• inputs: 𝑥, … , 𝑥𝑇
• initial state ℎ0 = 0, a result of previouscomputation, trainable parameter
• recurrent computation: ℎ𝑡 = 𝐴(ℎ𝑡−1, 𝑥𝑡)
Deep Learning for Natural Language Processing 20/90
![Page 46: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/46.jpg)
Recurrent Networks (RNNs)
…the default choice for sequence labeling
• inputs: 𝑥, … , 𝑥𝑇• initial state ℎ0 = 0, a result of previous
computation, trainable parameter
• recurrent computation: ℎ𝑡 = 𝐴(ℎ𝑡−1, 𝑥𝑡)
Deep Learning for Natural Language Processing 20/90
![Page 47: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/47.jpg)
Recurrent Networks (RNNs)
…the default choice for sequence labeling
• inputs: 𝑥, … , 𝑥𝑇• initial state ℎ0 = 0, a result of previous
computation, trainable parameter• recurrent computation: ℎ𝑡 = 𝐴(ℎ𝑡−1, 𝑥𝑡)
Deep Learning for Natural Language Processing 20/90
![Page 48: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/48.jpg)
RNN as Imperative Code
def rnn(initial_state, inputs):prev_state = initial_statefor x in inputs:new_state, output = rnn_cell(x, prev_state)prev_state = new_stateyield output
Deep Learning for Natural Language Processing 21/90
![Page 49: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/49.jpg)
RNN as a Fancy Image
Deep Learning for Natural Language Processing 22/90
![Page 50: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/50.jpg)
Vanilla RNN
ℎ𝑡 = tanh (𝑊[ℎ𝑡−1; 𝑥𝑡] + 𝑏)
• cannot propagate long-distance relations• vanishing gradient problem
Deep Learning for Natural Language Processing 23/90
![Page 51: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/51.jpg)
Vanilla RNN
ℎ𝑡 = tanh (𝑊[ℎ𝑡−1; 𝑥𝑡] + 𝑏)
• cannot propagate long-distance relations
• vanishing gradient problem
Deep Learning for Natural Language Processing 23/90
![Page 52: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/52.jpg)
Vanilla RNN
ℎ𝑡 = tanh (𝑊[ℎ𝑡−1; 𝑥𝑡] + 𝑏)
• cannot propagate long-distance relations• vanishing gradient problem
Deep Learning for Natural Language Processing 23/90
![Page 53: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/53.jpg)
Vanishing Gradient Problem (1)
tanh 𝑥 = 1 − 𝑒−2𝑥
1 + 𝑒−2𝑥
-1.0
-0.5
0.0
0.5
1.0
−6 −4 −2 0 2 4 6-1
-0.5
0
0.5
1
-6 -4 -2 0 2 4 6
Y
X
𝑦
𝑥
dtanh 𝑥d𝑥 = 1 − tanh2 𝑥 ∈ (0, 1]
0.00.20.40.60.81.0
−6 −4 −2 0 2 4 6
𝑦
𝑥
Weight initialized ∼ 𝒩(0, 1) to have gradients further from zero.
Deep Learning for Natural Language Processing 24/90
![Page 54: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/54.jpg)
Vanishing Gradient Problem (1)
tanh 𝑥 = 1 − 𝑒−2𝑥
1 + 𝑒−2𝑥
-1.0
-0.5
0.0
0.5
1.0
−6 −4 −2 0 2 4 6-1
-0.5
0
0.5
1
-6 -4 -2 0 2 4 6
Y
X
𝑦
𝑥
dtanh 𝑥d𝑥 = 1 − tanh2 𝑥 ∈ (0, 1]
0.00.20.40.60.81.0
−6 −4 −2 0 2 4 6
𝑦
𝑥Weight initialized ∼ 𝒩(0, 1) to have gradients further from zero.
Deep Learning for Natural Language Processing 24/90
![Page 55: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/55.jpg)
Vanishing Gradient Problem (2)
∂𝐸𝑡+1∂𝑏 = ∂𝐸𝑡+1
∂ℎ𝑡+1⋅ ∂ℎ𝑡+1
∂𝑏 (chain rule)
Deep Learning for Natural Language Processing 25/90
![Page 56: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/56.jpg)
Vanishing Gradient Problem (2)
∂𝐸𝑡+1∂𝑏 =
∂𝐸𝑡+1∂ℎ𝑡+1
⋅ ∂ℎ𝑡+1∂𝑏 (chain rule)
Deep Learning for Natural Language Processing 25/90
![Page 57: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/57.jpg)
Vanishing Gradient Problem (2)
∂𝐸𝑡+1∂𝑏 = ∂𝐸𝑡+1
∂ℎ𝑡+1⋅ ∂ℎ𝑡+1
∂𝑏 (chain rule)
Deep Learning for Natural Language Processing 25/90
![Page 58: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/58.jpg)
Vanishing Gradient Problem (3)
∂ℎ𝑡∂𝑏 =
∂ tanh=𝑧𝑡 (activation)
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊ℎℎ𝑡−1 + 𝑊𝑥𝑥𝑡 + 𝑏)∂𝑏 (tanh′ is derivative of tanh)
= tanh′(𝑧𝑡) ⋅ ⎛⎜⎜⎝
∂𝑊ℎℎ𝑡−1∂𝑏 + ∂𝑊𝑥𝑥𝑡
∂𝑏⏟=0
+ ∂𝑏∂𝑏⏟=1
⎞⎟⎟⎠
= 𝑊ℎ⏟∼𝒩(0,1)
tanh′(𝑧𝑡)⏟⏟⏟⏟⏟∈(0;1]
∂ℎ𝑡−1∂𝑏 + tanh′(𝑧𝑡)
Deep Learning for Natural Language Processing 26/90
![Page 59: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/59.jpg)
Vanishing Gradient Problem (3)
∂ℎ𝑡∂𝑏 = ∂ tanh
=𝑧𝑡 (activation)⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊ℎℎ𝑡−1 + 𝑊𝑥𝑥𝑡 + 𝑏)
∂𝑏 (tanh′ is derivative of tanh)
= tanh′(𝑧𝑡) ⋅ ⎛⎜⎜⎝
∂𝑊ℎℎ𝑡−1∂𝑏 + ∂𝑊𝑥𝑥𝑡
∂𝑏⏟=0
+ ∂𝑏∂𝑏⏟=1
⎞⎟⎟⎠
= 𝑊ℎ⏟∼𝒩(0,1)
tanh′(𝑧𝑡)⏟⏟⏟⏟⏟∈(0;1]
∂ℎ𝑡−1∂𝑏 + tanh′(𝑧𝑡)
Deep Learning for Natural Language Processing 26/90
![Page 60: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/60.jpg)
Vanishing Gradient Problem (3)
∂ℎ𝑡∂𝑏 = ∂ tanh
=𝑧𝑡 (activation)⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊ℎℎ𝑡−1 + 𝑊𝑥𝑥𝑡 + 𝑏)
∂𝑏 (tanh′ is derivative of tanh)
= tanh′(𝑧𝑡) ⋅ ⎛⎜⎜⎝
∂𝑊ℎℎ𝑡−1∂𝑏 + ∂𝑊𝑥𝑥𝑡
∂𝑏⏟=0
+ ∂𝑏∂𝑏⏟=1
⎞⎟⎟⎠
= 𝑊ℎ⏟∼𝒩(0,1)
tanh′(𝑧𝑡)⏟⏟⏟⏟⏟∈(0;1]
∂ℎ𝑡−1∂𝑏 + tanh′(𝑧𝑡)
Deep Learning for Natural Language Processing 26/90
![Page 61: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/61.jpg)
Vanishing Gradient Problem (3)
∂ℎ𝑡∂𝑏 = ∂ tanh
=𝑧𝑡 (activation)⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞(𝑊ℎℎ𝑡−1 + 𝑊𝑥𝑥𝑡 + 𝑏)
∂𝑏 (tanh′ is derivative of tanh)
= tanh′(𝑧𝑡) ⋅ ⎛⎜⎜⎝
∂𝑊ℎℎ𝑡−1∂𝑏 + ∂𝑊𝑥𝑥𝑡
∂𝑏⏟=0
+ ∂𝑏∂𝑏⏟=1
⎞⎟⎟⎠
= 𝑊ℎ⏟∼𝒩(0,1)
tanh′(𝑧𝑡)⏟⏟⏟⏟⏟∈(0;1]
∂ℎ𝑡−1∂𝑏 + tanh′(𝑧𝑡)
Deep Learning for Natural Language Processing 26/90
![Page 62: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/62.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:
• what to use from input,• what to use from hidden state,• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 63: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/63.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:
• what to use from input,• what to use from hidden state,• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 64: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/64.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:
• what to use from input,• what to use from hidden state,• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 65: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/65.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:• what to use from input,
• what to use from hidden state,• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 66: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/66.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:• what to use from input,• what to use from hidden state,
• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 67: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/67.jpg)
Long Short-Term Memory Networks
LSTM = Long short-term memorySepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667
Control the gradient flow by explicitly gating:• what to use from input,• what to use from hidden state,• what to put on output
Deep Learning for Natural Language Processing 27/90
![Page 68: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/68.jpg)
LSTM: Hidden State
• two types of hidden states
• ℎ𝑡 — “public” hidden state, used an output• 𝑐𝑡 — “private” memory, no non-linearities on the way• direct flow of gradients (without multiplying by ≤ 1 derivatives)
Deep Learning for Natural Language Processing 28/90
![Page 69: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/69.jpg)
LSTM: Hidden State
• two types of hidden states• ℎ𝑡 — “public” hidden state, used an output
• 𝑐𝑡 — “private” memory, no non-linearities on the way• direct flow of gradients (without multiplying by ≤ 1 derivatives)
Deep Learning for Natural Language Processing 28/90
![Page 70: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/70.jpg)
LSTM: Hidden State
• two types of hidden states• ℎ𝑡 — “public” hidden state, used an output• 𝑐𝑡 — “private” memory, no non-linearities on the way
• direct flow of gradients (without multiplying by ≤ 1 derivatives)
Deep Learning for Natural Language Processing 28/90
![Page 71: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/71.jpg)
LSTM: Hidden State
• two types of hidden states• ℎ𝑡 — “public” hidden state, used an output• 𝑐𝑡 — “private” memory, no non-linearities on the way• direct flow of gradients (without multiplying by ≤ 1 derivatives)
Deep Learning for Natural Language Processing 28/90
![Page 72: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/72.jpg)
LSTM: Forget Gate
𝑓𝑡 = 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓)
• based on input and previous state, decide what to forget from the memory
Deep Learning for Natural Language Processing 29/90
![Page 73: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/73.jpg)
LSTM: Forget Gate
𝑓𝑡 = 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓)
• based on input and previous state, decide what to forget from the memory
Deep Learning for Natural Language Processing 29/90
![Page 74: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/74.jpg)
LSTM: Input Gate
𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)
• 𝐶 — candidate what may want to add to the memory• 𝑖𝑡 — decide how much of the information we want to store
Deep Learning for Natural Language Processing 30/90
![Page 75: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/75.jpg)
LSTM: Input Gate
𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)
• 𝐶 — candidate what may want to add to the memory
• 𝑖𝑡 — decide how much of the information we want to store
Deep Learning for Natural Language Processing 30/90
![Page 76: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/76.jpg)
LSTM: Input Gate
𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)
• 𝐶 — candidate what may want to add to the memory• 𝑖𝑡 — decide how much of the information we want to store
Deep Learning for Natural Language Processing 30/90
![Page 77: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/77.jpg)
LSTM: Cell State Update
𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ 𝐶𝑡
Deep Learning for Natural Language Processing 31/90
![Page 78: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/78.jpg)
LSTM: Output Gate
𝑜𝑡 = 𝜎 (𝑊𝑜 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜)
ℎ𝑡 = 𝑜𝑡 ⊙ tanh 𝐶𝑡
Deep Learning for Natural Language Processing 32/90
![Page 79: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/79.jpg)
Here we are, LSTM!
𝑓𝑡 = 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓)𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)𝑜𝑡 = 𝜎 (𝑊𝑜 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ 𝐶𝑡ℎ𝑡 = 𝑜𝑡 ⊙ tanh 𝐶𝑡
Question How would you implement it efficiently?Compute all gates in a single matrix multiplication.
Deep Learning for Natural Language Processing 33/90
![Page 80: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/80.jpg)
Here we are, LSTM!
𝑓𝑡 = 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓)𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)𝑜𝑡 = 𝜎 (𝑊𝑜 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ 𝐶𝑡ℎ𝑡 = 𝑜𝑡 ⊙ tanh 𝐶𝑡
Question How would you implement it efficiently?
Compute all gates in a single matrix multiplication.
Deep Learning for Natural Language Processing 33/90
![Page 81: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/81.jpg)
Here we are, LSTM!
𝑓𝑡 = 𝜎 (𝑊𝑓[ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑓)𝑖𝑡 = 𝜎 (𝑊𝑖 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑖)𝑜𝑡 = 𝜎 (𝑊𝑜 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝑜)
𝐶𝑡 = tanh (𝑊𝑐 ⋅ [ℎ𝑡−1; 𝑥𝑡] + 𝑏𝐶)𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ 𝐶𝑡ℎ𝑡 = 𝑜𝑡 ⊙ tanh 𝐶𝑡
Question How would you implement it efficiently?Compute all gates in a single matrix multiplication.
Deep Learning for Natural Language Processing 33/90
![Page 82: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/82.jpg)
Gated Recurrent Unitsupdate gate 𝑧𝑡 = 𝜎(𝑥𝑡𝑊𝑧 + ℎ𝑡−1𝑈𝑧 + 𝑏𝑧) ∈ (0, 1)remember gate 𝑟𝑡 = 𝜎(𝑥𝑡𝑊𝑟 + ℎ𝑡−1𝑈𝑟 + 𝑏𝑟) ∈ (0, 1)candidate hidden state ℎ𝑡 = tanh (𝑥𝑡𝑊ℎ + (𝑟𝑡 ⊙ ℎ𝑡−1)𝑈ℎ) ∈ (−1, 1)hidden state ℎ𝑡 = (1 − 𝑧𝑡) ⊙ ℎ𝑡−1 + 𝑧𝑡 ⋅ ℎ𝑡
Deep Learning for Natural Language Processing 34/90
![Page 83: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/83.jpg)
LSTM vs. GRU
• GRU is smaller and therefore faster
• performance similar, task dependent• theoretical limitation: GRU accepts regular languages, LSTM can simulate counter
machine
Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR,abs/1412.3555, 2014. ISSN 2331-8422;
Deep Learning for Natural Language Processing 35/90
![Page 84: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/84.jpg)
LSTM vs. GRU
• GRU is smaller and therefore faster• performance similar, task dependent
• theoretical limitation: GRU accepts regular languages, LSTM can simulate countermachine
Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR,abs/1412.3555, 2014. ISSN 2331-8422;
Deep Learning for Natural Language Processing 35/90
![Page 85: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/85.jpg)
LSTM vs. GRU
• GRU is smaller and therefore faster• performance similar, task dependent• theoretical limitation: GRU accepts regular languages, LSTM can simulate counter
machine
Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR,abs/1412.3555, 2014. ISSN 2331-8422;
Deep Learning for Natural Language Processing 35/90
![Page 86: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/86.jpg)
RNN in PyTorch
rnn = nn.LSTM(input_dim, hidden_dim=512, num_layers=1,bidirectional=True, dropout=0.8)
output, (hidden, cell) = self.rnn(x)
https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
Deep Learning for Natural Language Processing 36/90
![Page 87: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/87.jpg)
RNN in TensorFlow
inputs = ... # float tf.Tensor of shape [batch, length, dim]lengths = ... # int tf.Tensor of shape [batch]
# Cell objects are templatesfw_cell = tf.nn.rnn_cell.LSTMCell(512, name="fw_cell")bw_cell = tf.nn.rnn_cell.LSTMCell(512, name="bw_cell")
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, inputs, sequence_length=lengths)
https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn
Deep Learning for Natural Language Processing 37/90
![Page 88: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/88.jpg)
Bidirectional Networks
• simple trick to improve performance
• run one RNN forward, second one backward and concatenate outputs
Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/
• state of the art in tagging, crucial for neural machine translation
Deep Learning for Natural Language Processing 38/90
![Page 89: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/89.jpg)
Bidirectional Networks
• simple trick to improve performance• run one RNN forward, second one backward and concatenate outputs
Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/
• state of the art in tagging, crucial for neural machine translation
Deep Learning for Natural Language Processing 38/90
![Page 90: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/90.jpg)
Bidirectional Networks
• simple trick to improve performance• run one RNN forward, second one backward and concatenate outputs
Image from: http://colah.github.io/posts/2015-09-NN-Types-FP/
• state of the art in tagging, crucial for neural machine translation
Deep Learning for Natural Language Processing 38/90
![Page 91: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/91.jpg)
Representing Sequences
Convolutional Networks
![Page 92: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/92.jpg)
1-D Convolution
≈ sliding window over the sequence
embeddings x = (𝑥1, … , 𝑥𝑁)
𝑥0 = 0 𝑥𝑁 = 0
ℎ1 = 𝑓 (𝑊[𝑥0; 𝑥1; 𝑥2] + 𝑏)ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
pad with 0s if we want to keep sequence length
Deep Learning for Natural Language Processing 39/90
![Page 93: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/93.jpg)
1-D Convolution
≈ sliding window over the sequence
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
ℎ1 = 𝑓 (𝑊[𝑥0; 𝑥1; 𝑥2] + 𝑏)ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
pad with 0s if we want to keep sequence length
Deep Learning for Natural Language Processing 39/90
![Page 94: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/94.jpg)
1-D Convolution
≈ sliding window over the sequence
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
ℎ1 = 𝑓 (𝑊[𝑥0; 𝑥1; 𝑥2] + 𝑏)
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
pad with 0s if we want to keep sequence length
Deep Learning for Natural Language Processing 39/90
![Page 95: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/95.jpg)
1-D Convolution
≈ sliding window over the sequence
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
ℎ1 = 𝑓 (𝑊[𝑥0; 𝑥1; 𝑥2] + 𝑏)
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
pad with 0s if we want to keep sequence length
Deep Learning for Natural Language Processing 39/90
![Page 96: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/96.jpg)
1-D Convolution: Pseudocode
xs = ... # input sequnce
kernel_size = 3 # window sizefilters = 300 # output dimensionsstrides=1 # step size
W = trained_parameter(xs.shape[2] * kernel_size, filters)b = trained_parameter(filters)window = kernel_size // 2
outputs = []for i in range(window, xs.shape[1] - window):
h = np.mul(W, xs[i - window:i + window]) + boutputs.append(h)
return np.array(h)
Deep Learning for Natural Language Processing 40/90
![Page 97: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/97.jpg)
1-D Convolution: Frameworks
TensorFlow
h = tf.layers.conv1d(x, filters=300 kernel_size=3,strides=1, padding='same')
https://www.tensorflow.org/api_docs/python/tf/layers/conv1d
PyTorch
conv = nn.Conv1d(in_channels, out_channels=300, kernel_size=3, stride=1,padding=0, dilation=1, groups=1, bias=True)
h = conv(x)
https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d
Deep Learning for Natural Language Processing 41/90
![Page 98: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/98.jpg)
Rectified Linear Units
ReLU:
0.01.02.03.04.05.06.0
−6 −4 −2 0 2 4 6
𝑦
𝑥
Derivative of ReLU:
0.00.20.40.60.81.0
−6 −4 −2 0 2 4 6
𝑦
𝑥faster, suffer less with vanishing gradient
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference onMachine Learning, pages 807–814, Haifa, Israel, June 2010. JMLR.org
Deep Learning for Natural Language Processing 42/90
![Page 99: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/99.jpg)
Residual Connections
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏)
Allows training deeper networks.
Why do you think it helps?Better gradient flow – the same as in RNNs.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511
Deep Learning for Natural Language Processing 43/90
![Page 100: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/100.jpg)
Residual Connections
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏) + 𝑥𝑖
Allows training deeper networks.
Why do you think it helps?Better gradient flow – the same as in RNNs.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511
Deep Learning for Natural Language Processing 43/90
![Page 101: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/101.jpg)
Residual Connections
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏) + 𝑥𝑖
Allows training deeper networks.Why do you think it helps?
Better gradient flow – the same as in RNNs.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511
Deep Learning for Natural Language Processing 43/90
![Page 102: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/102.jpg)
Residual Connections
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
ℎ𝑖 = 𝑓 (𝑊 [𝑥𝑖−1; 𝑥𝑖; 𝑥𝑖+1] + 𝑏) + 𝑥𝑖
Allows training deeper networks.Why do you think it helps?
Better gradient flow – the same as in RNNs.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, Las Vegas, NV, USA, June 2016. IEEE Computer Society. ISBN 9781467388511
Deep Learning for Natural Language Processing 43/90
![Page 103: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/103.jpg)
Residual Connections: Numerical StabilityNumerically unstable, we need activation to be in similar scale ⇒ layer normalization.Activation before non-linearity is normalized:
𝑎𝑖 = 𝑔𝑖𝜎𝑖
(𝑎𝑖 − 𝜇𝑖)
…𝑔 is a trainable parameter, 𝜇, 𝜎 estimated from data.
𝜇 = 1𝐻
𝐻∑𝑖=1
𝑎𝑖
𝜎 =√√√⎷
1𝐻
𝐻∑𝑖=1
(𝑎𝑖 − 𝜇)2
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. ISSN 2331-8422
Deep Learning for Natural Language Processing 44/90
![Page 104: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/104.jpg)
Receptive Field
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0Can be enlarged by dilated convolutions.
Deep Learning for Natural Language Processing 45/90
![Page 105: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/105.jpg)
Receptive Field
embeddings x = (𝑥1, … , 𝑥𝑁)𝑥0 = 0 𝑥𝑁 = 0Can be enlarged by dilated convolutions.
Deep Learning for Natural Language Processing 45/90
![Page 106: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/106.jpg)
Convolutional architectures
+• extremely computationally efficient
–• limited context• by default no aware of 𝑛-gram order
• max-pooling over the hidden states = element-wise maximum over sequence
• can be understood as an ∃ operator over the feature extractors
Deep Learning for Natural Language Processing 46/90
![Page 107: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/107.jpg)
Convolutional architectures
+• extremely computationally efficient
–• limited context• by default no aware of 𝑛-gram order
• max-pooling over the hidden states = element-wise maximum over sequence• can be understood as an ∃ operator over the feature extractors
Deep Learning for Natural Language Processing 46/90
![Page 108: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/108.jpg)
Representing Sequences
Self-attentive Networks
![Page 109: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/109.jpg)
Self-attentive Networks
• In some layers: states are linear combination of previous layer states
• Originally for the Transformer model for machine translation
• similarity matrix between all pairs of states• 𝑂(𝑛2) memory, 𝑂(1) time (when paralelized)• next layer: sum by rows
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language Processing 47/90
![Page 110: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/110.jpg)
Self-attentive Networks
• In some layers: states are linear combination of previous layer states• Originally for the Transformer model for machine translation
• similarity matrix between all pairs of states• 𝑂(𝑛2) memory, 𝑂(1) time (when paralelized)• next layer: sum by rows
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language Processing 47/90
![Page 111: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/111.jpg)
Self-attentive Networks
• In some layers: states are linear combination of previous layer states• Originally for the Transformer model for machine translation
• similarity matrix between all pairs of states• 𝑂(𝑛2) memory, 𝑂(1) time (when paralelized)• next layer: sum by rows
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30, pages 6000–6010, Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language Processing 47/90
![Page 112: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/112.jpg)
Multi-head scaled dot-product attention
Single-head setup
Attn(𝑄, 𝐾, 𝑉 ) = softmax (𝑄𝐾⊤√
𝑑) 𝑉
ℎ𝑖+1 = ∑ softmax (ℎ𝑖ℎ⊤𝑖√
𝑑)
Multi-head setup
Multihead(𝑄, 𝑉 ) = (𝐻1 ⊕ ⋯ ⊕ 𝐻ℎ)𝑊 𝑂
𝐻𝑖 = Attn(𝑄𝑊 𝑄𝑖 , 𝑉 𝑊 𝐾
𝑖 , 𝑉 𝑊 𝑉𝑖 )
keys & values queries
linear linear linear
split split split
concat
scaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attention
Deep Learning for Natural Language Processing 48/90
![Page 113: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/113.jpg)
Multi-head scaled dot-product attention
Single-head setup
Attn(𝑄, 𝐾, 𝑉 ) = softmax (𝑄𝐾⊤√
𝑑) 𝑉
ℎ𝑖+1 = ∑ softmax (ℎ𝑖ℎ⊤𝑖√
𝑑)
Multi-head setup
Multihead(𝑄, 𝑉 ) = (𝐻1 ⊕ ⋯ ⊕ 𝐻ℎ)𝑊 𝑂
𝐻𝑖 = Attn(𝑄𝑊 𝑄𝑖 , 𝑉 𝑊 𝐾
𝑖 , 𝑉 𝑊 𝑉𝑖 )
keys & values queries
linear linear linear
split split split
concat
scaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attention
Deep Learning for Natural Language Processing 48/90
![Page 114: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/114.jpg)
Multi-head scaled dot-product attention
Single-head setup
Attn(𝑄, 𝐾, 𝑉 ) = softmax (𝑄𝐾⊤√
𝑑) 𝑉
ℎ𝑖+1 = ∑ softmax (ℎ𝑖ℎ⊤𝑖√
𝑑)
Multi-head setup
Multihead(𝑄, 𝑉 ) = (𝐻1 ⊕ ⋯ ⊕ 𝐻ℎ)𝑊 𝑂
𝐻𝑖 = Attn(𝑄𝑊 𝑄𝑖 , 𝑉 𝑊 𝐾
𝑖 , 𝑉 𝑊 𝑉𝑖 )
keys & values queries
linear linear linear
split split split
concat
scaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attentionscaled dot-product attention
Deep Learning for Natural Language Processing 48/90
![Page 115: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/115.jpg)
Dot-Product Attention in PyTorch
def attention(query, key, value, mask=None):d_k = query.size(-1)scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)p_attn = F.softmax(scores, dim = -1)return torch.matmul(p_attn, value), p_attn
Deep Learning for Natural Language Processing 49/90
![Page 116: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/116.jpg)
Dot-Product Attention in TensorFlow
def scaled_dot_product(self, queries, keys, values):o1 = tf.matmul(queries, keys, transpose_b=True)o2 = o1 / (dim**0.5)
o3 = tf.nn.softmax(o2)return tf.matmul(o3, values)
Deep Learning for Natural Language Processing 50/90
![Page 117: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/117.jpg)
Position Encoding
Model is not aware of the position in the sequence.
pos(𝑖) =⎧{⎨{⎩
sin ( 𝑡104
𝑖𝑑 ) , if 𝑖 mod 2 = 0
cos ( 𝑡104
𝑖−1𝑑 ) , otherwise
Deep Learning for Natural Language Processing 51/90
![Page 118: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/118.jpg)
Stacking self-attentive Layers
input embeddings
⊕positionencoding
self-attentivesublayer
multiheadattention
keys &values
queries
⊕
layer normalization
feed-forwardsublayer
non-linear layer
linear layer
⊕
layer normalization
𝑁×
• several layers (original paper 6)
• each layer: 2 sub-layers: self-attention andfeed-forward layer
• everything inter-connected with residualconnections
Deep Learning for Natural Language Processing 52/90
![Page 119: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/119.jpg)
Stacking self-attentive Layers
input embeddings
⊕positionencoding
self-attentivesublayer
multiheadattention
keys &values
queries
⊕
layer normalization
feed-forwardsublayer
non-linear layer
linear layer
⊕
layer normalization
𝑁×
• several layers (original paper 6)• each layer: 2 sub-layers: self-attention and
feed-forward layer
• everything inter-connected with residualconnections
Deep Learning for Natural Language Processing 52/90
![Page 120: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/120.jpg)
Stacking self-attentive Layers
input embeddings
⊕positionencoding
self-attentivesublayer
multiheadattention
keys &values
queries
⊕
layer normalization
feed-forwardsublayer
non-linear layer
linear layer
⊕
layer normalization
𝑁×
• several layers (original paper 6)• each layer: 2 sub-layers: self-attention and
feed-forward layer• everything inter-connected with residual
connections
Deep Learning for Natural Language Processing 52/90
![Page 121: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/121.jpg)
Architectures Comparison
computation sequential operations memoryRecurrent 𝑂(𝑛 ⋅ 𝑑2) 𝑂(𝑛) 𝑂(𝑛 ⋅ 𝑑)Convolutional 𝑂(𝑘 ⋅ 𝑛 ⋅ 𝑑2) 𝑂(1) 𝑂(𝑛 ⋅ 𝑑)Self-attentive 𝑂(𝑛2 ⋅ 𝑑) 𝑂(1) 𝑂(𝑛2 ⋅ 𝑑)
𝑑 model dimension, 𝑛 sequence length, 𝑘 convolutional kernel
Deep Learning for Natural Language Processing 53/90
![Page 122: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/122.jpg)
Classification and Labeling
![Page 123: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/123.jpg)
Classification and Labeling
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 54/90
![Page 124: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/124.jpg)
Sequence Classification
• tasks like sentiment analysis, genre classification
• need to get one vector from sequence → average or max pooling• optionally hidden layers, at the end softmax for probability distribution over classes
Deep Learning for Natural Language Processing 55/90
![Page 125: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/125.jpg)
Sequence Classification
• tasks like sentiment analysis, genre classification• need to get one vector from sequence → average or max pooling
• optionally hidden layers, at the end softmax for probability distribution over classes
Deep Learning for Natural Language Processing 55/90
![Page 126: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/126.jpg)
Sequence Classification
• tasks like sentiment analysis, genre classification• need to get one vector from sequence → average or max pooling• optionally hidden layers, at the end softmax for probability distribution over classes
Deep Learning for Natural Language Processing 55/90
![Page 127: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/127.jpg)
Softmax & Cross-Entropy
Output layer with softmax (with parameters 𝑊 , 𝑏):
𝑃𝑦 = softmax(x) = P(𝑦 = 𝑗 ∣ x) = expx⊤𝑊 + 𝑏∑ expx⊤𝑊 + 𝑏
Network error = cross-entropy between estimated distribution and one-hot ground-truthdistribution 𝑇 = 1(𝑦∗):
𝐿(𝑃𝑦, 𝑦∗) = 𝐻(𝑃 , 𝑇 ) = −𝔼𝑖∼𝑇 log 𝑃 (𝑖)= − ∑
𝑖𝑇 (𝑖) log 𝑃(𝑖)
= − log 𝑃(𝑦∗)
Deep Learning for Natural Language Processing 56/90
![Page 128: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/128.jpg)
Softmax & Cross-Entropy
Output layer with softmax (with parameters 𝑊 , 𝑏):
𝑃𝑦 = softmax(x) = P(𝑦 = 𝑗 ∣ x) = expx⊤𝑊 + 𝑏∑ expx⊤𝑊 + 𝑏
Network error = cross-entropy between estimated distribution and one-hot ground-truthdistribution 𝑇 = 1(𝑦∗):
𝐿(𝑃𝑦, 𝑦∗) = 𝐻(𝑃 , 𝑇 ) = −𝔼𝑖∼𝑇 log 𝑃 (𝑖)= − ∑
𝑖𝑇 (𝑖) log 𝑃(𝑖)
= − log 𝑃(𝑦∗)
Deep Learning for Natural Language Processing 56/90
![Page 129: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/129.jpg)
Derivative of Cross-Entropy
Let 𝑙 = x⊤𝑊 + 𝑏, 𝑙𝑦∗ corresponds to the correct one.
∂𝐿(𝑃𝑦, 𝑦∗)∂𝑙 = − ∂
∂𝑙 log exp 𝑙𝑦∗
∑𝑗 exp 𝑙𝑗= − ∂
∂𝑙𝑙𝑦∗ − log ∑ exp 𝑙
= 1𝑦∗ + ∂∂𝑙 − log ∑ exp 𝑙 = 1𝑦∗ − ∑1𝑦∗ exp 𝑙
∑ exp 𝑙 =
= 1𝑦∗ − 𝑃𝑦(𝑦∗)
Interpretation: Reinforce the correct logit, suppress the rest.
Deep Learning for Natural Language Processing 57/90
![Page 130: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/130.jpg)
Sequence Labeling
• assign value / probability distribution to every token in a sequence
• morphological tagging, named-entity recognition, LM with unlimited history, answerspan selection
• every state is classified independently with a classifier• during training, error babckpropagate form all classifiers
Lab next time: i/y spelling as sequence labeling
Deep Learning for Natural Language Processing 58/90
![Page 131: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/131.jpg)
Sequence Labeling
• assign value / probability distribution to every token in a sequence• morphological tagging, named-entity recognition, LM with unlimited history, answer
span selection
• every state is classified independently with a classifier• during training, error babckpropagate form all classifiers
Lab next time: i/y spelling as sequence labeling
Deep Learning for Natural Language Processing 58/90
![Page 132: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/132.jpg)
Sequence Labeling
• assign value / probability distribution to every token in a sequence• morphological tagging, named-entity recognition, LM with unlimited history, answer
span selection• every state is classified independently with a classifier
• during training, error babckpropagate form all classifiers
Lab next time: i/y spelling as sequence labeling
Deep Learning for Natural Language Processing 58/90
![Page 133: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/133.jpg)
Sequence Labeling
• assign value / probability distribution to every token in a sequence• morphological tagging, named-entity recognition, LM with unlimited history, answer
span selection• every state is classified independently with a classifier• during training, error babckpropagate form all classifiers
Lab next time: i/y spelling as sequence labeling
Deep Learning for Natural Language Processing 58/90
![Page 134: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/134.jpg)
Generating Sequences
![Page 135: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/135.jpg)
Sequence-to-sequence Learning
• target sequence is of different length than source
• non-trivial (= not monotonic) correspondence of source and target• tasks like: machine translation, text summarization, image captioning
Deep Learning for Natural Language Processing 59/90
![Page 136: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/136.jpg)
Sequence-to-sequence Learning
• target sequence is of different length than source• non-trivial (= not monotonic) correspondence of source and target
• tasks like: machine translation, text summarization, image captioning
Deep Learning for Natural Language Processing 59/90
![Page 137: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/137.jpg)
Sequence-to-sequence Learning
• target sequence is of different length than source• non-trivial (= not monotonic) correspondence of source and target• tasks like: machine translation, text summarization, image captioning
Deep Learning for Natural Language Processing 59/90
![Page 138: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/138.jpg)
Neural Language Model
input symbolone-hot vectors
embedding lookup
RNN cell(more layers)
RNN state
normalizationdistribution forthe next symbol
<s>
embed
RNN
ℎ0
softmax
𝑃(𝑤1|<s>)
𝑤1
embed
RNN
ℎ1
softmax
𝑃(𝑤1| …)
𝑤2
embed
RNN
ℎ2
softmax
𝑃 (𝑤2| …)
⋯
• estimate probability of a sentence using the chain rule
• output distributions can be used for sampling
Deep Learning for Natural Language Processing 60/90
![Page 139: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/139.jpg)
Neural Language Model
input symbolone-hot vectors
embedding lookup
RNN cell(more layers)
RNN state
normalizationdistribution forthe next symbol
<s>
embed
RNN
ℎ0
softmax
𝑃(𝑤1|<s>)
𝑤1
embed
RNN
ℎ1
softmax
𝑃(𝑤1| …)
𝑤2
embed
RNN
ℎ2
softmax
𝑃 (𝑤2| …)
⋯
• estimate probability of a sentence using the chain rule• output distributions can be used for sampling
Deep Learning for Natural Language Processing 60/90
![Page 140: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/140.jpg)
Sampling from a LM
embed
RNN
ℎ0
softmax
P(𝑤1|<s>)
argmax
embed
RNN
ℎ1
softmax
P(𝑤1| …)
argmax
embed
RNN
ℎ2
softmax
P(𝑤2| …)
argmax
embed
RNN
ℎ3
softmax
P(𝑤3| …)
argmax
<s>
⋯
when conditioned on input → autoregressive decoder
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27,pages 3104–3112, Montreal, Canada, December 2014. Curran Associates, Inc
Deep Learning for Natural Language Processing 61/90
![Page 141: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/141.jpg)
Sampling from a LM
embed
RNN
ℎ0
softmax
P(𝑤1|<s>)
argmax
embed
RNN
ℎ1
softmax
P(𝑤1| …)
argmax
embed
RNN
ℎ2
softmax
P(𝑤2| …)
argmax
embed
RNN
ℎ3
softmax
P(𝑤3| …)
argmax
<s>
⋯
when conditioned on input → autoregressive decoderIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27,
pages 3104–3112, Montreal, Canada, December 2014. Curran Associates, Inc
Deep Learning for Natural Language Processing 61/90
![Page 142: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/142.jpg)
Autoregressive Decoding: Pseudocode
last_w = "<s>"while last_w != "</s>":
last_w_embeding = target_embeddings[last_w]state, dec_output = dec_cell(state,
last_w_embeding)logits = output_projection(dec_output)last_w = np.argmax(logits)yield last_w
Deep Learning for Natural Language Processing 62/90
![Page 143: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/143.jpg)
Architectures in the Decoder
• RNN – original sequence-to-sequence learning (2015)
• principle known since 2014 (University of Montreal)• made usable in 2016 (University of Edinburgh)
• CNN – convolution sequence-to-sequence by Facebook (2017)• Self-attention (so called Transformer) by Google (2017)
More on the topic in the MT class.
Deep Learning for Natural Language Processing 63/90
![Page 144: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/144.jpg)
Architectures in the Decoder
• RNN – original sequence-to-sequence learning (2015)• principle known since 2014 (University of Montreal)
• made usable in 2016 (University of Edinburgh)• CNN – convolution sequence-to-sequence by Facebook (2017)• Self-attention (so called Transformer) by Google (2017)
More on the topic in the MT class.
Deep Learning for Natural Language Processing 63/90
![Page 145: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/145.jpg)
Architectures in the Decoder
• RNN – original sequence-to-sequence learning (2015)• principle known since 2014 (University of Montreal)• made usable in 2016 (University of Edinburgh)
• CNN – convolution sequence-to-sequence by Facebook (2017)• Self-attention (so called Transformer) by Google (2017)
More on the topic in the MT class.
Deep Learning for Natural Language Processing 63/90
![Page 146: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/146.jpg)
Architectures in the Decoder
• RNN – original sequence-to-sequence learning (2015)• principle known since 2014 (University of Montreal)• made usable in 2016 (University of Edinburgh)
• CNN – convolution sequence-to-sequence by Facebook (2017)
• Self-attention (so called Transformer) by Google (2017)
More on the topic in the MT class.
Deep Learning for Natural Language Processing 63/90
![Page 147: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/147.jpg)
Architectures in the Decoder
• RNN – original sequence-to-sequence learning (2015)• principle known since 2014 (University of Montreal)• made usable in 2016 (University of Edinburgh)
• CNN – convolution sequence-to-sequence by Facebook (2017)• Self-attention (so called Transformer) by Google (2017)
More on the topic in the MT class.
Deep Learning for Natural Language Processing 63/90
![Page 148: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/148.jpg)
Implementation: Runtime vs. training
runtime: 𝑦𝑗 (decoded) × training: 𝑦𝑗 (ground truth)
<s>
~y1 ~y2 ~y3 ~y4 ~y5
<s> x1 x2 x3 x4
<s>y1 y2 y3 y4
loss
Deep Learning for Natural Language Processing 64/90
![Page 149: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/149.jpg)
Attention Model
<s> x1 x2 x3 x4
~yi ~yi+1
h1h0 h2 h3 h4
...
+
×α0
×α1
×α2
×α3
×α4
sisi-1 si+1
+
Deep Learning for Natural Language Processing 65/90
![Page 150: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/150.jpg)
Attention Model in Equations (1)Inputs:decoder state 𝑠𝑖encoder states ℎ𝑗 = [ ℎ𝑗; ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 = exp (𝑒𝑖𝑗)∑𝑇𝑥
𝑘=1 exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =𝑇𝑥
∑𝑗=1
𝛼𝑖𝑗ℎ𝑗
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.ISSN 2331-8422
Deep Learning for Natural Language Processing 66/90
![Page 151: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/151.jpg)
Attention Model in Equations (1)Inputs:decoder state 𝑠𝑖encoder states ℎ𝑗 = [ ℎ𝑗; ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 = exp (𝑒𝑖𝑗)∑𝑇𝑥
𝑘=1 exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =𝑇𝑥
∑𝑗=1
𝛼𝑖𝑗ℎ𝑗
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.ISSN 2331-8422
Deep Learning for Natural Language Processing 66/90
![Page 152: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/152.jpg)
Attention Model in Equations (1)Inputs:decoder state 𝑠𝑖encoder states ℎ𝑗 = [ ℎ𝑗; ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 = exp (𝑒𝑖𝑗)∑𝑇𝑥
𝑘=1 exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =𝑇𝑥
∑𝑗=1
𝛼𝑖𝑗ℎ𝑗
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.ISSN 2331-8422
Deep Learning for Natural Language Processing 66/90
![Page 153: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/153.jpg)
Attention Model in Equations (1)Inputs:decoder state 𝑠𝑖encoder states ℎ𝑗 = [ ℎ𝑗; ℎ𝑗] ∀𝑖 = 1 … 𝑇𝑥
Attention energies:
𝑒𝑖𝑗 = 𝑣⊤𝑎 tanh (𝑊𝑎𝑠𝑖−1 + 𝑈𝑎ℎ𝑗 + 𝑏𝑎)
Attention distribution:
𝛼𝑖𝑗 = exp (𝑒𝑖𝑗)∑𝑇𝑥
𝑘=1 exp (𝑒𝑖𝑘)
Context vector:
𝑐𝑖 =𝑇𝑥
∑𝑗=1
𝛼𝑖𝑗ℎ𝑗
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.ISSN 2331-8422
Deep Learning for Natural Language Processing 66/90
![Page 154: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/154.jpg)
Attention Model in Equations (2)
Output projection:
𝑡𝑖 = MLP (𝑈𝑜𝑠𝑖−1 + 𝑉𝑜𝐸𝑦𝑖−1 + 𝐶𝑜𝑐𝑖 + 𝑏𝑜)
…attention is mixed with the hidden state
Output distribution:
𝑝 (𝑦𝑖 = 𝑘|𝑠𝑖, 𝑦𝑖−1, 𝑐𝑖) ∝ exp (𝑊𝑜𝑡𝑖)𝑘 + 𝑏𝑘
Deep Learning for Natural Language Processing 67/90
![Page 155: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/155.jpg)
Attention Model in Equations (2)
Output projection:
𝑡𝑖 = MLP (𝑈𝑜𝑠𝑖−1 + 𝑉𝑜𝐸𝑦𝑖−1 + 𝐶𝑜𝑐𝑖 + 𝑏𝑜)
…attention is mixed with the hidden state
Output distribution:
𝑝 (𝑦𝑖 = 𝑘|𝑠𝑖, 𝑦𝑖−1, 𝑐𝑖) ∝ exp (𝑊𝑜𝑡𝑖)𝑘 + 𝑏𝑘
Deep Learning for Natural Language Processing 67/90
![Page 156: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/156.jpg)
Transformer Decoder
input embeddings
⊕positionencoding
self-attentivesublayer
multiheadattention
keys &values
queries
⊕
layer normalization
cross-attentionsublayer
multiheadattention
keys &values
queries
⊕
layer normalization
encoder
feed-forwardsublayer
non-linear layer
linear layer
⊕
layer normalization
𝑁×
linear
softmax
output symbol probabilities
• similar to encoder, additional layer withattention to the encoder
• in every steps self-attention overcomplete history ⇒ 𝑂(𝑛2) complexity
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.In Advances in Neural Information Processing Systems 30, pages 6000–6010,Long Beach, CA, USA, December 2017. Curran Associates, Inc
Deep Learning for Natural Language Processing 68/90
![Page 157: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/157.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder
• target is known at training: don’t need towait until it’s generated
• self attention can be parallelized viamatrix multiplication
• prevent attentding the future using amask
Question 1: What if the matrix was diagonal?Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 158: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/158.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder• target is known at training: don’t need to
wait until it’s generated
• self attention can be parallelized viamatrix multiplication
• prevent attentding the future using amask
Question 1: What if the matrix was diagonal?Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 159: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/159.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder• target is known at training: don’t need to
wait until it’s generated• self attention can be parallelized via
matrix multiplication
• prevent attentding the future using amask
Question 1: What if the matrix was diagonal?Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 160: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/160.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder• target is known at training: don’t need to
wait until it’s generated• self attention can be parallelized via
matrix multiplication• prevent attentding the future using a
mask
Question 1: What if the matrix was diagonal?Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 161: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/161.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder• target is known at training: don’t need to
wait until it’s generated• self attention can be parallelized via
matrix multiplication• prevent attentding the future using a
mask
Question 1: What if the matrix was diagonal?
Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 162: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/162.jpg)
Transfomer Decoder: Non-autoregressive training
𝑣1𝑣2𝑣3
…
𝑣𝑀
𝑞1 𝑞2 𝑞3
…
𝑞𝑁
………
… …
Queries 𝑄Va
lues
𝑉
−∞
• analogical to encoder• target is known at training: don’t need to
wait until it’s generated• self attention can be parallelized via
matrix multiplication• prevent attentding the future using a
mask
Question 1: What if the matrix was diagonal?Question 2: How such a matrix look like for convolutional architecture?
Deep Learning for Natural Language Processing 69/90
![Page 163: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/163.jpg)
Pre-training Representations
![Page 164: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/164.jpg)
Pre-training Representations
Neural Networks BasicsRepresenting WordsRepresenting Sequences
Recurrent NetworksConvolutional NetworksSelf-attentive Networks
Classification and LabelingGenerating SequencesPre-training Representations
Word2VecELMoBERT
Deep Learning for Natural Language Processing 70/90
![Page 165: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/165.jpg)
Pre-trained Representations
• representations that emerge in models seem to carry a lot of information about thelanguage
• representations pre-trained on large data can be re-used on tasks with smaller trainingdata
Deep Learning for Natural Language Processing 71/90
![Page 166: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/166.jpg)
Pre-trained Representations
• representations that emerge in models seem to carry a lot of information about thelanguage
• representations pre-trained on large data can be re-used on tasks with smaller trainingdata
Deep Learning for Natural Language Processing 71/90
![Page 167: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/167.jpg)
Pre-training Representations
Word2Vec
![Page 168: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/168.jpg)
Word2Vec• way to learn word embeddings without training the complete LM
CBOW Skip-gram
∑ 𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
• CBOW: minimize cross-entropy of the middle word of a sliding windows• skip-gram: minimize cross-entropy of a bag of words around a word (LM other way
round)
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.Association for Computational Linguistics
Deep Learning for Natural Language Processing 72/90
![Page 169: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/169.jpg)
Word2Vec• way to learn word embeddings without training the complete LM
CBOW Skip-gram
∑ 𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
• CBOW: minimize cross-entropy of the middle word of a sliding windows
• skip-gram: minimize cross-entropy of a bag of words around a word (LM other wayround)
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.Association for Computational Linguistics
Deep Learning for Natural Language Processing 72/90
![Page 170: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/170.jpg)
Word2Vec• way to learn word embeddings without training the complete LM
CBOW Skip-gram
∑ 𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
𝑤3
𝑤1
𝑤2
𝑤4
𝑤5
⋮
• CBOW: minimize cross-entropy of the middle word of a sliding windows• skip-gram: minimize cross-entropy of a bag of words around a word (LM other way
round)Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA, USA, June 2013.Association for Computational Linguistics
Deep Learning for Natural Language Processing 72/90
![Page 171: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/171.jpg)
Word2Vec: sampling
1. All human beings are born free and equal in dignity … → (All, humans)(All, beings)
2. All human beings are born free and equal in dignity … →(human, All)
(human, beings)(human, are)
3. All human beings are born free and equal in dignity … →(beings, All)
(beings, human)(beings, are)
(beings, born)
4. All human beings are born free and equal in dignity … →(are, human)(are, beings)(are, born)(are, free)
Deep Learning for Natural Language Processing 73/90
![Page 172: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/172.jpg)
Word2Vec: Formulas
• Training objective:
1𝑇
𝑇∑𝑡=1
∑𝑗∼(−𝑐,𝑐)
log 𝑝(𝑤𝑡+𝑐|𝑤𝑡)
• Probability estimation:
𝑝(𝑤𝑂|𝑤𝐼) =exp (𝑉 ′⊤
𝑤𝑂𝑉𝑤𝐼
)∑𝑤 exp (𝑉 ′⊤
𝑤𝑉𝑤𝑖)
where 𝑉 is input (embedding) matrix, 𝑉 ′ output matrix
Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,USA, June 2013. Association for Computational Linguistics
Deep Learning for Natural Language Processing 74/90
![Page 173: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/173.jpg)
Word2Vec: Formulas
• Training objective:
1𝑇
𝑇∑𝑡=1
∑𝑗∼(−𝑐,𝑐)
log 𝑝(𝑤𝑡+𝑐|𝑤𝑡)
• Probability estimation:
𝑝(𝑤𝑂|𝑤𝐼) =exp (𝑉 ′⊤
𝑤𝑂𝑉𝑤𝐼
)∑𝑤 exp (𝑉 ′⊤
𝑤𝑉𝑤𝑖)
where 𝑉 is input (embedding) matrix, 𝑉 ′ output matrix
Equations 1, 2. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,USA, June 2013. Association for Computational Linguistics
Deep Learning for Natural Language Processing 74/90
![Page 174: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/174.jpg)
Word2Vec: Training using Negative Sampling
The summation in denominator is slow, use noise contrastive estimation:
log 𝜎 (𝑉 ′⊤𝑤𝑂
𝑉𝑤𝐼) +
𝑘∑𝑖=1
𝐸𝑤𝑖∼𝑃𝑛(𝑤) [log 𝜎 (−𝑉 ′⊤𝑤𝑖
𝑉𝑤𝐼)]
Main idea: classify independently by logistic regression the positive and few samplednegative examples.
Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,
USA, June 2013. Association for Computational Linguistics
Deep Learning for Natural Language Processing 75/90
![Page 175: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/175.jpg)
Word2Vec: Training using Negative Sampling
The summation in denominator is slow, use noise contrastive estimation:
log 𝜎 (𝑉 ′⊤𝑤𝑂
𝑉𝑤𝐼) +
𝑘∑𝑖=1
𝐸𝑤𝑖∼𝑃𝑛(𝑤) [log 𝜎 (−𝑉 ′⊤𝑤𝑖
𝑉𝑤𝐼)]
Main idea: classify independently by logistic regression the positive and few samplednegative examples.
Equations 1, 3. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, GA,
USA, June 2013. Association for Computational Linguistics
Deep Learning for Natural Language Processing 75/90
![Page 176: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/176.jpg)
Word2Vec: Vector Arithmetics
man
woman
uncle
aunt
king
queen
kings
queens
king
queen
Image originally from Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the
2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta,
GA, USA, June 2013. Association for Computational Linguistics
Deep Learning for Natural Language Processing 76/90
![Page 177: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/177.jpg)
Few More Notes on Embeddings
• many method for pre-trained words embeddings (most popluar GloVe)
• embeddings capturing character-level properties• multilingual embeddings
Deep Learning for Natural Language Processing 77/90
![Page 178: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/178.jpg)
Few More Notes on Embeddings
• many method for pre-trained words embeddings (most popluar GloVe)• embeddings capturing character-level properties
• multilingual embeddings
Deep Learning for Natural Language Processing 77/90
![Page 179: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/179.jpg)
Few More Notes on Embeddings
• many method for pre-trained words embeddings (most popluar GloVe)• embeddings capturing character-level properties• multilingual embeddings
Deep Learning for Natural Language Processing 77/90
![Page 180: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/180.jpg)
Training models
FastText – Word2Vec model implementation by Facebookhttps://github.com/facebookresearch/fastText
./fasttext skipgram -input data.txt -output model
Deep Learning for Natural Language Processing 78/90
![Page 181: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/181.jpg)
Pre-training Representations
ELMo
![Page 182: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/182.jpg)
What is ELMo?
• pre-trained large language model
• “nothing special” – combines allknown tricks, trained on extremelylarge data
• improves almost all NLP tasks• published in June 2018
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA, USA, June 2018. Association for Computational Linguistics
Deep Learning for Natural Language Processing 79/90
![Page 183: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/183.jpg)
What is ELMo?
• pre-trained large language model• “nothing special” – combines all
known tricks, trained on extremelylarge data
• improves almost all NLP tasks• published in June 2018
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA, USA, June 2018. Association for Computational Linguistics
Deep Learning for Natural Language Processing 79/90
![Page 184: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/184.jpg)
What is ELMo?
• pre-trained large language model• “nothing special” – combines all
known tricks, trained on extremelylarge data
• improves almost all NLP tasks
• published in June 2018
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA, USA, June 2018. Association for Computational Linguistics
Deep Learning for Natural Language Processing 79/90
![Page 185: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/185.jpg)
What is ELMo?
• pre-trained large language model• “nothing special” – combines all
known tricks, trained on extremelylarge data
• improves almost all NLP tasks• published in June 2018
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 2227–2237, New Orleans, LA, USA, June 2018. Association for Computational Linguistics
Deep Learning for Natural Language Processing 79/90
![Page 186: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/186.jpg)
ELMo Architecture: Input
character embeddings of size 16
1D-convolution to 2,048 dimensions+ max-pool
window filters1 322 323 644 1285 2566 5127 1024
2× highway layer (2,048 dimensions)
linear projection to 512 dimensions• input tokenized, treated on character
level
• 2,048 𝑛-gram filters + max-pooling(∼ soft search for learned 𝑛-grams)
• 2 highway layers:
𝑔𝑙+1 = 𝜎 (𝑊𝑔ℎ𝑙 + 𝑏𝑔)ℎ𝑙+1 = (1 − 𝑔𝑙+1) ⊙ ℎ𝑙+
𝑔𝑙+1 ⊙ ReLu (𝑊ℎ𝑙 + 𝑏)
contain gates that contol ifprojection is needed
Deep Learning for Natural Language Processing 80/90
![Page 187: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/187.jpg)
ELMo Architecture: Input
character embeddings of size 16
1D-convolution to 2,048 dimensions+ max-pool
window filters1 322 323 644 1285 2566 5127 1024
2× highway layer (2,048 dimensions)
linear projection to 512 dimensions• input tokenized, treated on character
level• 2,048 𝑛-gram filters + max-pooling
(∼ soft search for learned 𝑛-grams)
• 2 highway layers:
𝑔𝑙+1 = 𝜎 (𝑊𝑔ℎ𝑙 + 𝑏𝑔)ℎ𝑙+1 = (1 − 𝑔𝑙+1) ⊙ ℎ𝑙+
𝑔𝑙+1 ⊙ ReLu (𝑊ℎ𝑙 + 𝑏)
contain gates that contol ifprojection is needed
Deep Learning for Natural Language Processing 80/90
![Page 188: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/188.jpg)
ELMo Architecture: Input
character embeddings of size 16
1D-convolution to 2,048 dimensions+ max-pool
window filters1 322 323 644 1285 2566 5127 1024
2× highway layer (2,048 dimensions)
linear projection to 512 dimensions• input tokenized, treated on character
level• 2,048 𝑛-gram filters + max-pooling
(∼ soft search for learned 𝑛-grams)• 2 highway layers:
𝑔𝑙+1 = 𝜎 (𝑊𝑔ℎ𝑙 + 𝑏𝑔)ℎ𝑙+1 = (1 − 𝑔𝑙+1) ⊙ ℎ𝑙+
𝑔𝑙+1 ⊙ ReLu (𝑊ℎ𝑙 + 𝑏)
contain gates that contol ifprojection is needed
Deep Learning for Natural Language Processing 80/90
![Page 189: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/189.jpg)
ELMo Architecture: Language Models
• token representations input for 2 language models: forward and backward
• both LMs 2 layers with 4,096 dimensions with layer normalization and residualconnections
• output classifier shared (only used in training, does not have to be good)
Learned layer combination for downstream tasks:
ELMotask𝑘 = 𝛾task ∑
layer𝐿𝑠task
𝐿 ℎ(𝐿)𝑘
𝛾task, 𝑠task𝐿 trainable parameters.
Deep Learning for Natural Language Processing 81/90
![Page 190: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/190.jpg)
ELMo Architecture: Language Models
• token representations input for 2 language models: forward and backward• both LMs 2 layers with 4,096 dimensions with layer normalization and residual
connections
• output classifier shared (only used in training, does not have to be good)
Learned layer combination for downstream tasks:
ELMotask𝑘 = 𝛾task ∑
layer𝐿𝑠task
𝐿 ℎ(𝐿)𝑘
𝛾task, 𝑠task𝐿 trainable parameters.
Deep Learning for Natural Language Processing 81/90
![Page 191: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/191.jpg)
ELMo Architecture: Language Models
• token representations input for 2 language models: forward and backward• both LMs 2 layers with 4,096 dimensions with layer normalization and residual
connections• output classifier shared (only used in training, does not have to be good)
Learned layer combination for downstream tasks:
ELMotask𝑘 = 𝛾task ∑
layer𝐿𝑠task
𝐿 ℎ(𝐿)𝑘
𝛾task, 𝑠task𝐿 trainable parameters.
Deep Learning for Natural Language Processing 81/90
![Page 192: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/192.jpg)
ELMo Architecture: Language Models
• token representations input for 2 language models: forward and backward• both LMs 2 layers with 4,096 dimensions with layer normalization and residual
connections• output classifier shared (only used in training, does not have to be good)
Learned layer combination for downstream tasks:
ELMotask𝑘 = 𝛾task ∑
layer𝐿𝑠task
𝐿 ℎ(𝐿)𝑘
𝛾task, 𝑠task𝐿 trainable parameters.
Deep Learning for Natural Language Processing 81/90
![Page 193: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/193.jpg)
Task where ELMo helps
Answer Span SelectionFind an answer to a question in aunstructured text.
Semantic Role LabelingDetect who did what to whom insentences.
Natural Language InferenceDecide whether two sentences are inagreement, contradict each other, or havenothing to do with each other.
Named Entity RecognitionDetect and classify names people,locations, organization, numbers withunits, email addresses, URLs, phonenumbers …
Coreference ResolutionDetect what entities pronouns refer to.
Semantic SimilarityMeasure how similar meaning twosentences have. (Think of clusteringsimilar question on StackOverflow ordetecting plagiarism.)
Deep Learning for Natural Language Processing 82/90
![Page 194: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/194.jpg)
Improvements by Elmo
Deep Learning for Natural Language Processing 83/90
![Page 195: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/195.jpg)
How to use it
• implemetned in AllenNLPframework (uses PyTorch)
• pre-trained English modelsavailable
from allennlp.modules.elmo import Elmo,batch_to_ids
options_file = ...weight_file = ...
elmo = Elmo(options_file, weight_file, 2,dropout=0)
sentences = [['First', 'sentence', '.'],['Another', '.']]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md
Deep Learning for Natural Language Processing 84/90
![Page 196: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/196.jpg)
How to use it
• implemetned in AllenNLPframework (uses PyTorch)
• pre-trained English modelsavailable
from allennlp.modules.elmo import Elmo,batch_to_ids
options_file = ...weight_file = ...
elmo = Elmo(options_file, weight_file, 2,dropout=0)
sentences = [['First', 'sentence', '.'],['Another', '.']]
character_ids = batch_to_ids(sentences)
embeddings = elmo(character_ids)
https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md
Deep Learning for Natural Language Processing 84/90
![Page 197: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/197.jpg)
Pre-training Representations
BERT
![Page 198: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/198.jpg)
What is BERT
• another way of pretraining sentencerepresentations
• uses Transformer architecture andslightly different training objective
• even beeter than ELMo• done by Google, published in
November 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805, 2018. ISSN 2331-8422
Deep Learning for Natural Language Processing 85/90
![Page 199: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/199.jpg)
What is BERT
• another way of pretraining sentencerepresentations
• uses Transformer architecture andslightly different training objective
• even beeter than ELMo• done by Google, published in
November 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805, 2018. ISSN 2331-8422
Deep Learning for Natural Language Processing 85/90
![Page 200: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/200.jpg)
What is BERT
• another way of pretraining sentencerepresentations
• uses Transformer architecture andslightly different training objective
• even beeter than ELMo
• done by Google, published inNovember 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805, 2018. ISSN 2331-8422
Deep Learning for Natural Language Processing 85/90
![Page 201: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/201.jpg)
What is BERT
• another way of pretraining sentencerepresentations
• uses Transformer architecture andslightly different training objective
• even beeter than ELMo• done by Google, published in
November 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805, 2018. ISSN 2331-8422
Deep Learning for Natural Language Processing 85/90
![Page 202: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/202.jpg)
Achitecture Comparison
Deep Learning for Natural Language Processing 86/90
![Page 203: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/203.jpg)
Masked Language Model
All human being are born free and equal in dignity and rights
1. Randomly sample a word → free2. With 80% change replace with special MASK token.3. With 10% change replace with random token → hairy4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 204: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/204.jpg)
Masked Language Model
All human being are born free and equal in dignity and rights
1. Randomly sample a word → free
2. With 80% change replace with special MASK token.3. With 10% change replace with random token → hairy4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 205: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/205.jpg)
Masked Language Model
All human being are born MASK and equal in dignity and rights
1. Randomly sample a word → free2. With 80% change replace with special MASK token.
3. With 10% change replace with random token → hairy4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 206: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/206.jpg)
Masked Language Model
All human being are born hairy and equal in dignity and rights
1. Randomly sample a word → free2. With 80% change replace with special MASK token.3. With 10% change replace with random token → hairy
4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 207: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/207.jpg)
Masked Language Model
All human being are born free and equal in dignity and rights
1. Randomly sample a word → free2. With 80% change replace with special MASK token.3. With 10% change replace with random token → hairy4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 208: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/208.jpg)
Masked Language Model
All human being are born free and equal in dignity and rights
1. Randomly sample a word → free2. With 80% change replace with special MASK token.3. With 10% change replace with random token → hairy4. With 10% change keep as is → free
Then a classifier should predict the missing/replaced word free
Deep Learning for Natural Language Processing 87/90
![Page 209: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/209.jpg)
Additional Objective: Next Sentence Prediction
• trained in the multi-task learning setup
• secondary objective: next sentences prediction• decide for a pair of consecuitve sentences whether they follow each other
Deep Learning for Natural Language Processing 88/90
![Page 210: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/210.jpg)
Additional Objective: Next Sentence Prediction
• trained in the multi-task learning setup• secondary objective: next sentences prediction
• decide for a pair of consecuitve sentences whether they follow each other
Deep Learning for Natural Language Processing 88/90
![Page 211: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/211.jpg)
Additional Objective: Next Sentence Prediction
• trained in the multi-task learning setup• secondary objective: next sentences prediction• decide for a pair of consecuitve sentences whether they follow each other
Deep Learning for Natural Language Processing 88/90
![Page 212: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/212.jpg)
Performance of BERT
Tables 1 and 2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for languageunderstanding. CoRR, abs/1810.04805, 2018. ISSN 2331-8422
Deep Learning for Natural Language Processing 89/90
![Page 213: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/213.jpg)
Deep Learning for Natural Language Processing
Summary1. Discrete symbols → continuous representation with trained
embeddings
2. Architectures to get suitable representation: recurrent,convolutional, self-attentive
3. Output: classification, sequence labeling, autoregressivedecoding
4. Representations pretrained on large data helps on downstreamtasks
http://ufal.cz/courses/npfl124
![Page 214: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/214.jpg)
Deep Learning for Natural Language Processing
Summary1. Discrete symbols → continuous representation with trained
embeddings2. Architectures to get suitable representation: recurrent,
convolutional, self-attentive
3. Output: classification, sequence labeling, autoregressivedecoding
4. Representations pretrained on large data helps on downstreamtasks
http://ufal.cz/courses/npfl124
![Page 215: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/215.jpg)
Deep Learning for Natural Language Processing
Summary1. Discrete symbols → continuous representation with trained
embeddings2. Architectures to get suitable representation: recurrent,
convolutional, self-attentive3. Output: classification, sequence labeling, autoregressive
decoding
4. Representations pretrained on large data helps on downstreamtasks
http://ufal.cz/courses/npfl124
![Page 216: Deep Learning for Natural Language Processingufal.mff.cuni.cz/.../slides/lect13-deep-learning-nlp.pdf · 2020-04-15 · Deep Learning for Natural Language Processing Jindřich Helcl](https://reader034.vdocuments.mx/reader034/viewer/2022052518/5f07eb8c7e708231d41f6bce/html5/thumbnails/216.jpg)
Deep Learning for Natural Language Processing
Summary1. Discrete symbols → continuous representation with trained
embeddings2. Architectures to get suitable representation: recurrent,
convolutional, self-attentive3. Output: classification, sequence labeling, autoregressive
decoding4. Representations pretrained on large data helps on downstream
tasks
http://ufal.cz/courses/npfl124