attention is all you need (vaswani et al. 2017)shreyd/talks/transformers.pdfattention is all you...
TRANSCRIPT
Attention Is All You Need (Vaswani et al. 2017)
• Popularized self-attention
• Created the general-purpose Transformer architecture for sequence modeling
• Demonstrated computational savings over other models
Transformers: High-Level
• Sequence-to-sequence model with encoder and decoder
Encoder Decoder
Attention as Representations
• Attention generally used to scoreexisting encoder representations
• Why not use them asrepresentations?
This movie rocks !
Self-Attention
• Every element sees itself in its context
• Attention weight corresponds to an “important” signal
This movie rocks !
This movie rocks !
Self-Attention: Formalized
• Score the energy between a query Q and key K → scalar
• Use softmax-ed energy to take a weighted average of value V→ scalar * vector
Self-Attention: Example
• Score “she” (Q) against “Susan” and “the” (both K and V) in “Susan dropped the plate. She is clumsy”
she
she
the
Susan
0.3
0.7
the
Susan
Masked Self-Attention
• Modeling temporality requires enforcing causal relationships
• Mask out illegal connections in self-attention map
shewent
tothe
store
she
wen
t to the
stor
e
Multi-Head Self-Attention
• Problem: Self-attention is just a weighted average; how do we model complex relationships?
We went to the store at 7pm
Self-Attention
Q K V
We went to the store at 7pm
Multi-Head Self-Attention
• Solution: Use multiple self-attention heads!
Self-Attention
Q K V Q K VQ K V
Position-wise FFNN
• Feed-forward network mixes multi-head self-attention by operating on each position independently
we
went
Head 1Head 2Head 3
Head 1Head 2Head 3
FFNN
FFNN
hidden
hidden
sequence lengthx hidden dim
Positional Embeddings
• No convolutions or recurrence; use sinusoids to inject positional information into the model• Embedding is a function of position and dimension
Transformer: Full Model
Vaswani et al. (2017)
Results: MT
Vaswani et al. (2017)
Results: Constituency Parsing
Vaswani et al. (2017)
Why Transformers?
• Self-attention is flexible
• Highly modular and extensible
• Demonstrated empirical performance
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)
• Deep bidirectional Transformer architecture for (masked) language modeling
• Advances SOTA on 11 NLP tasks including GLUE, MNLI, and SQuAD
Background: ELMo
• Revitalized research in pretraining: creating unsupervised tasks from large unlabeled corpora (e.g., word2vec)
this movie rocks !
Char CNN Char CNN Char CNN Char CNN wordembedding
forwardLSTM
contextualembedding
backwardLSTM
BERT• Deeply bidirectional as opposed to ELMo (only a shallow
concatenation of LMs)
• Introduces two pretraining tasks:• Masked Language Modeling• Next Sentence Prediction
Pretraining: Masked Language Modeling
• Problem: bidirectional language modeling not possible as each token “sees” itself in context
• Solution: introduce a cloze-style task where the model tries to predict the missing word ([MASK])
we went to the store
went to the storeOutputs
Inputs
[MASK] went [MASK] the store
Pretraining: Next Sentence Prediction
• To learn inter-sentential relationships, determine if sentence B followssentence A; randomly sample sentence B
I went to the store at 7pm. The store had lots of fruit!
Selena Gomez is an American singer. Variational autoencoders are cool.
Sentence A Sentence B
Sentence A Sentence B
BERT: Inputs
Devlin et al. (2019)
BERT: Pretraining
BERT
[CLS] [SEP] [SEP]Segment A Segment B
NSP MASK MASK MASK MASK
Sentence representations are stored in [CLS]
Bidirectional representations are used to predict [MASK]
BERT: Fine-Tuning
BERT
[CLS] [SEP] [SEP]Premise Hypothesis
Features MLP
Entailment
Contradiction
Neutral
0.8
0.05
0.15
Results: GLUE
Devlin et al. (2019)
Results: SQuAD
Devlin et al. (2019)
Ablation: Pretraining Tasks
• No NSP: BERT trained without next sentence prediction• LTR & No NSP: Regular LM without next sentence prediction
Ablation: Model Size
• Increasing capacity consistently increases capacity; also consistent with future work (e.g., GPT-2, RoBERTa, etc.)
2019: The Year of Pretraining
GPT-2 XLM XLNet RoBERTa
ELECTRA ALBERT T5 BART