attention is all you need (vaswani et al. 2017)shreyd/talks/transformers.pdfattention is all you...

Attention Is All You Need (Vaswani et al. 2017)

• Popularized self-attention

• Created the general-purpose Transformer architecture for sequence modeling

• Demonstrated computational savings over other models

Transformers: High-Level

• Sequence-to-sequence model with encoder and decoder

Encoder Decoder

Attention as Representations

• Attention generally used to scoreexisting encoder representations

• Why not use them asrepresentations?

This movie rocks !

Self-Attention

• Every element sees itself in its context

• Attention weight corresponds to an “important” signal

This movie rocks !

Self-Attention: Formalized

• Score the energy between a query Q and key K → scalar

• Use softmax-ed energy to take a weighted average of value V→ scalar * vector

Self-Attention: Example

• Score “she” (Q) against “Susan” and “the” (both K and V) in “Susan dropped the plate. She is clumsy”

Masked Self-Attention

• Modeling temporality requires enforcing causal relationships

• Mask out illegal connections in self-attention map

shewent

t to the

Multi-Head Self-Attention

• Problem: Self-attention is just a weighted average; how do we model complex relationships?

We went to the store at 7pm

Self-Attention

We went to the store at 7pm

Multi-Head Self-Attention

• Solution: Use multiple self-attention heads!

Self-Attention

Q K V Q K VQ K V

Position-wise FFNN

• Feed-forward network mixes multi-head self-attention by operating on each position independently

Head 1Head 2Head 3

hidden

sequence lengthx hidden dim

Positional Embeddings

• No convolutions or recurrence; use sinusoids to inject positional information into the model• Embedding is a function of position and dimension

Transformer: Full Model

Vaswani et al. (2017)

Results: MT

Results: Constituency Parsing

Why Transformers?

• Self-attention is flexible

• Highly modular and extensible

• Demonstrated empirical performance

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)

• Deep bidirectional Transformer architecture for (masked) language modeling

• Advances SOTA on 11 NLP tasks including GLUE, MNLI, and SQuAD

Background: ELMo

• Revitalized research in pretraining: creating unsupervised tasks from large unlabeled corpora (e.g., word2vec)

this movie rocks !

Char CNN Char CNN Char CNN Char CNN wordembedding

forwardLSTM

contextualembedding

backwardLSTM

BERT• Deeply bidirectional as opposed to ELMo (only a shallow

concatenation of LMs)

• Introduces two pretraining tasks:• Masked Language Modeling• Next Sentence Prediction

Pretraining: Masked Language Modeling

• Problem: bidirectional language modeling not possible as each token “sees” itself in context

• Solution: introduce a cloze-style task where the model tries to predict the missing word ([MASK])

we went to the store

went to the storeOutputs

Inputs

[MASK] went [MASK] the store

Pretraining: Next Sentence Prediction

• To learn inter-sentential relationships, determine if sentence B followssentence A; randomly sample sentence B

I went to the store at 7pm. The store had lots of fruit!

Selena Gomez is an American singer. Variational autoencoders are cool.

Sentence A Sentence B

BERT: Inputs

Devlin et al. (2019)

BERT: Pretraining

[CLS] [SEP] [SEP]Segment A Segment B

NSP MASK MASK MASK MASK

Sentence representations are stored in [CLS]

Bidirectional representations are used to predict [MASK]

BERT: Fine-Tuning

[CLS] [SEP] [SEP]Premise Hypothesis

Features MLP

Entailment

Contradiction

Neutral

Results: GLUE

Results: SQuAD

Ablation: Pretraining Tasks

• No NSP: BERT trained without next sentence prediction• LTR & No NSP: Regular LM without next sentence prediction

Ablation: Model Size

• Increasing capacity consistently increases capacity; also consistent with future work (e.g., GPT-2, RoBERTa, etc.)

2019: The Year of Pretraining

GPT-2 XLM XLNet RoBERTa

ELECTRA ALBERT T5 BART

attention is all you need (vaswani et al. 2017)shreyd/talks/transformers.pdfattention is all you...

Documents

attention is all you need - arvutiteaduse instituut ·...

attention deficit disorder: what you need to know

reality matters: adhd landscapes need attention

attention is all you need · pdf fileattention is all you...

vaswani exquisite e brochure

social media... and why you need to pay attention

effective displays of data need more attention in...

vaswani reserve progress report

710b_akansha vaswani & diego flores

sadhu vaswani autonomous college, bairagarh. …

pilgrim of love - dada j.p. vaswani

jatin vaswani sst ppt 9a

the bravest people in the world need our attention

attention is all you need -...

sadhu vaswani international school for girls …

vaswani reserve

problems that need attention, the attack threats and

the bravest people in the world need our attention

pay attention: adult learners need engaging course design

attention need not always apply: mind wandering impedes