transformermiulab/s108-adl/doc/200407_transformer.pdf · transformer encoder block each block has...
TRANSCRIPT
![Page 2: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/2.jpg)
Sequence EncodingBasic Attention
2
![Page 3: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/3.jpg)
Representations of Variable Length Data
◉ Input: word sequence, image pixels, audio signal, click logs
◉ Property: continuity, temporal, importance distribution
◉ Example
✓ Basic combination: average, sum
✓ Neural combination: network architectures should consider input domain properties
− CNN (convolutional neural network)
− RNN (recurrent neural network): temporal information
Network architectures should consider the input domain properties
3
![Page 4: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/4.jpg)
Recurrent Neural Networks
◉ Learning variable-length representations
✓ Fit for sentences and sequences of values
◉ Sequential computation makes parallelization difficult
◉ No explicit modeling of long and short range dependencies
RNN
RNN
have
RNN
RNN
a
RNN
RNN
nice
4
![Page 5: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/5.jpg)
Convolutional Neural Networks
◉ Easy to parallelize
◉ Exploit local dependencies
✓ Long-distance dependencies require many layers
FFNN
conv
have
FFNN
conv
a
FFNN
conv
nice
5
![Page 6: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/6.jpg)
Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state
深 習度 學
ℎ1 ℎ2 ℎ3 ℎ4RNN
Encoder
Information of the whole sentences
ℎ4 ℎ4 ℎ4d
eep
learnin
g
<END
>
RNN
Decoder
6
![Page 7: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/7.jpg)
Machine Translation with Attention
𝑧0
深 習度 學
ℎ1 ℎ2 ℎ3 ℎ4
match
𝛼01
𝑐0
0.5 0.5 0.0 0.0
softmax
ො𝛼01 ො𝛼0
2 ො𝛼03 ො𝛼0
4
深 習度 學
ℎ1 ℎ2 ℎ3 ℎ4
query
key value
7
![Page 8: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/8.jpg)
Dot-Product Attention
◉ Input: a query 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output
◉ Output: weighted sum of values
✓ Query 𝑞 is a 𝑑𝑘-dim vector
✓ Key 𝑘 is a 𝑑𝑘-dim vector
✓ Value 𝑣 is a 𝑑𝑣-dim vector
Inner product of
query and corresponding key
8
![Page 9: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/9.jpg)
Dot-Product Attention in Matrix
◉ Input: multiple queries 𝑞 and a set of key-value (𝑘-𝑣) pairs to an output
◉ Output: a set of weighted sum of values
softmax
row-wise
9
![Page 10: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/10.jpg)
Sequence EncodingSelf-Attention
10
![Page 11: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/11.jpg)
Attention
◉ Encoder-decoder model is important in NMT
◉ RNNs need attention mechanism to handle long dependencies
◉ Attention allows us to access any state
Using attention to replace recurrence architectures
11
![Page 12: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/12.jpg)
Self-Attention
◉ Constant “path length” between two positions
◉ Easy to parallelize
FFNN
cmp
have
FFNN
a
FFNN
cmp
nice
+××
FFNN
cmp
×
day
12
![Page 13: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/13.jpg)
Transformer Idea
FFNN
cmp
FFNN FFNN
cmp
+××
FFNN
cmp
×
softmax
Self Attention
Feed-Forward NN
Self-A
tten
tio
nF
FN
N
FFNN
cmp
FFNN FFNN
cmp
+××
FFNN
cmp
×
softmax
Self Attention
Feed-Forward NN
Self-A
tten
tio
nF
FN
N
Encoder-Decoder Attention
softmax
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
Encoder-Decoder Attention
13
![Page 14: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/14.jpg)
Encoder Self-Attention (Vaswani+, 2017)
Dot-Prod
FFNN
Dot-Prod
+××
Dot-Prod
×
MatMulK
MatMulV MatMulV
softmax
MatMulQ
Dot-Prod
Value
QueryKey
MatMulV
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
14
![Page 15: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/15.jpg)
Decoder Self-Attention (Vaswani+, 2017)
Dot-Prod
…
Dot-Prod
+××
Dot-Prod
×
MatMulK
MatMulV MatMulV
softmax
MatMulQ
Dot-Prod
Value
QueryKey
MatMulV
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
15
![Page 16: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/16.jpg)
Sequence EncodingMulti-Head Attention
16
![Page 17: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/17.jpg)
Convolutions
I kicked the ball
kicked
who
did
what
to whom
17
![Page 18: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/18.jpg)
Self-Attention
I kicked the ball
kicked
18
![Page 19: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/19.jpg)
Attention Head: who
I kicked the ball
kicked
who
19
![Page 20: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/20.jpg)
Attention Head: did what
I kicked the ball
kicked
did
what
20
![Page 21: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/21.jpg)
Attention Head: to whom
I kicked the ball
kicked
to whom
21
![Page 22: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/22.jpg)
Multi-Head Attention
I kicked the ball
kicked
whodid
what
to whom
22
![Page 23: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/23.jpg)
Comparison
◉ Convolution: different linear transformations by relative positions
◉ Attention: a weighted average
◉ Multi-Head Attention: parallel attention layers with different linear transformations
on input/output
I kicked the ball
kicked
I kicked the ball
kicked
I kicked the ball
kicked
23
![Page 24: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/24.jpg)
Sequence EncodingTransformer
24
![Page 25: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/25.jpg)
Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
25
![Page 26: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/26.jpg)
Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
Multi-Head
Attention
Multi-Head
Attention
Masked
Multi-Head
Attention
26
![Page 27: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/27.jpg)
Multi-Head Attention
◉ Idea: allow words to interact with one another
◉ Model
− Map V, K, Q to lower dimensional spaces
− Apply attention, concatenate outputs
− Linear transformation
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
27
![Page 28: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/28.jpg)
Scaled Dot-Product Attention
◉ Problem: when 𝑑𝑘 gets large, the variance of 𝑞𝑇𝑘 increases
→ some values inside softmax get large
→ the softmax gets very peaked
→ hence its gradient gets smaller
◉ Solution: scale by length of query/key vectors
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
28
![Page 29: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/29.jpg)
Transformer Overview
◉ Non-recurrent encoder-decoder for MT
◉ PyTorch explanation by Sasha Rush
http://nlp.seas.harvard.edu/2018/04/03/attention.html
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
Add & Norm
Multi-Head
Attention
Feed
Forward
Add & Norm
29
![Page 30: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/30.jpg)
Transformer Encoder Block
◉ Each block has
− multi-head attention
− 2-layer feed-forward NN (w/ ReLU)
◉ Both parts contain
− Residual connection
− Layer normalization (LayerNorm)
Change input to have 0 mean and 1 variance per layer & per training point
→ LayerNorm(x + sublayer(x))
Add & Norm
Multi-Head
Attention
Feed Forward
Add & Norm
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
30
![Page 31: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/31.jpg)
Encoder Input
◉ Problem: temporal information is missing
◉ Solution: positional encoding allows words at different locations to have
different embeddings with fixed dimensionsAdd & Norm
Multi-Head
Attention
Feed
Forward
Add & Norm
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
31
![Page 32: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/32.jpg)
32
![Page 33: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/33.jpg)
Multi-Head Attention Details
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-中文版-ef2ddf8597a4
33
![Page 34: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/34.jpg)
Training Tips
◉ Byte-pair encodings (BPE)
◉ Checkpoint averaging
◉ ADAM optimizer with learning rate changes
◉ Dropout during training at every layer just before adding residual
◉ Label smoothing
◉ Auto-regressive decoding with beam search and length penalties
34
![Page 35: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/35.jpg)
MT Experiments
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
35
![Page 36: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/36.jpg)
Parsing Experiments
Vaswani et al., “Attention Is All You Need”, in NIPS, 2017.
36
![Page 37: Transformermiulab/s108-adl/doc/200407_Transformer.pdf · Transformer Encoder Block Each block has − multi-head attention − 2-layer feed-forward NN (w/ ReLU) Both parts contain](https://reader034.vdocuments.mx/reader034/viewer/2022051806/5ffc4f5b0d715332f96fdbe5/html5/thumbnails/37.jpg)
Concluding Remarks
◉ Non-recurrence model is easy to parallelize
◉ Multi-head attention captures different aspects by
interacting between words
◉ Positional encoding captures location information
◉ Each transformer block can be applied to diverse tasks
37