deep learning for natural language processing the
TRANSCRIPT
Deep Learning for Natural LanguageProcessing
The Transformer model
Richard Johansson
-20pt
drawbacks of recurrent models
I even with GRUs and LSTMs, it is difficult for RNNs topreserve information over long distances
I we introduced attention as a way to deal with this problemI can we skip the RNN and just use attention?
-20pt
attention models: recapI first, compute an “energy” ei for each state hi
I for the attention weights, we apply the softmax:
αi =exp ei∑nj=1 exp ej
I finally, the “summary” is computed as a weighted sum
s =n∑
i=1
αihi
-20pt
the Transformer
I the Transformer (Vaswani et al., 2017) is anarchitecture that uses attention for information flow:“Attention is all you need”
I it was originally designed for machine translationand has two parts:I an encoder that “summarizes” an input sentenceI a decoder (a conditional LM) that generates an
output, based on the input
I let’s consider the encoder
-20pt
multi-head attention
I in each layer, the Transformer applies severalattention models (“heads”) in parallel
I intuitively, the heads are “looking” for differenttypes of information
I each attention head computes a scaled dotproduct attention:
ei =1√dqi · kj
α = softmax(e)
where qi and kj are linear transformations of theinput at positions i and j
-20pt
a layer in the Transformer encoder
I after each application of multi-head attention, a2-layer feedforward model (with ReLU activation) isapplied
I residual connections (“shortcuts”) and layernormalization (Ba et al., 2016) added for robustnessand to facilitate training
I the Transformer encoder consists of a stack of thistype of block
-20pt
the road ahead
I the full Transformer is an effective model formachine translation
I we’ll return to it when we discussencoder–decoder architectures
I for now, let’s use it simply as a pre-trainedrepresentation
-20pt
references
J. Ba, J. Kiros, and G. Hinton. 2016. Layer normalization. arXiv:1607.06450.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In NIPS 30 .
J. Vig. 2019. Visualizing attention in transformer-based languagerepresentation models. arXiv:1904.02679.