attention is all you need (vaswani et al. 2017)shreyd/talks/transformers.pdfattention is all you...

Attention Is All You Need (Vaswani et al. 2017)

• Popularized self-attention

• Created the general-purpose Transformer architecture for sequence modeling

• Demonstrated computational savings over other models

Transformers: High-Level

• Sequence-to-sequence model with encoder and decoder

Encoder Decoder

Attention as Representations

• Attention generally used to scoreexisting encoder representations

• Why not use them asrepresentations?

This movie rocks !

Self-Attention

• Every element sees itself in its context

• Attention weight corresponds to an “important” signal

This movie rocks !

This movie rocks !

Self-Attention: Formalized

• Score the energy between a query Q and key K → scalar

• Use softmax-ed energy to take a weighted average of value V→ scalar * vector

Self-Attention: Example

• Score “she” (Q) against “Susan” and “the” (both K and V) in “Susan dropped the plate. She is clumsy”

she

she

the

Susan

0.3

0.7

the

Susan

Masked Self-Attention

• Modeling temporality requires enforcing causal relationships

• Mask out illegal connections in self-attention map

shewent

tothe

store

she

wen

t to the

stor

e

Multi-Head Self-Attention

• Problem: Self-attention is just a weighted average; how do we model complex relationships?

We went to the store at 7pm

Self-Attention

Q K V

We went to the store at 7pm

Multi-Head Self-Attention

• Solution: Use multiple self-attention heads!

Self-Attention

Q K V Q K VQ K V

Position-wise FFNN

• Feed-forward network mixes multi-head self-attention by operating on each position independently

we

went

Head 1Head 2Head 3

Head 1Head 2Head 3

FFNN

FFNN

hidden

hidden

sequence lengthx hidden dim

Positional Embeddings

• No convolutions or recurrence; use sinusoids to inject positional information into the model• Embedding is a function of position and dimension

Transformer: Full Model

Vaswani et al. (2017)

Results: MT


Results: Constituency Parsing


Why Transformers?

• Self-attention is flexible

• Highly modular and extensible

• Demonstrated empirical performance

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)

• Deep bidirectional Transformer architecture for (masked) language modeling

• Advances SOTA on 11 NLP tasks including GLUE, MNLI, and SQuAD

Background: ELMo

• Revitalized research in pretraining: creating unsupervised tasks from large unlabeled corpora (e.g., word2vec)

this movie rocks !

Char CNN Char CNN Char CNN Char CNN wordembedding

forwardLSTM

contextualembedding

backwardLSTM

BERT• Deeply bidirectional as opposed to ELMo (only a shallow

concatenation of LMs)

• Introduces two pretraining tasks:• Masked Language Modeling• Next Sentence Prediction

Pretraining: Masked Language Modeling

• Problem: bidirectional language modeling not possible as each token “sees” itself in context

• Solution: introduce a cloze-style task where the model tries to predict the missing word ([MASK])

we went to the store

went to the storeOutputs

Inputs

[MASK] went [MASK] the store

Pretraining: Next Sentence Prediction

• To learn inter-sentential relationships, determine if sentence B followssentence A; randomly sample sentence B

I went to the store at 7pm. The store had lots of fruit!

Selena Gomez is an American singer. Variational autoencoders are cool.

Sentence A Sentence B

Sentence A Sentence B

BERT: Inputs

Devlin et al. (2019)

BERT: Pretraining

BERT

[CLS] [SEP] [SEP]Segment A Segment B

NSP MASK MASK MASK MASK

Sentence representations are stored in [CLS]

Bidirectional representations are used to predict [MASK]

BERT: Fine-Tuning

BERT

[CLS] [SEP] [SEP]Premise Hypothesis

Features MLP

Entailment

Contradiction

Neutral

0.8

0.05

0.15

Results: GLUE


Results: SQuAD


Ablation: Pretraining Tasks

• No NSP: BERT trained without next sentence prediction• LTR & No NSP: Regular LM without next sentence prediction

Ablation: Model Size

• Increasing capacity consistently increases capacity; also consistent with future work (e.g., GPT-2, RoBERTa, etc.)

2019: The Year of Pretraining

GPT-2 XLM XLNet RoBERTa

ELECTRA ALBERT T5 BART

attention is all you need (vaswani et al. 2017)shreyd/talks/transformers.pdfattention is all you...

Documents