03: language models rico sennrich - informatics...

32
Machine Translation 03: Language Models Rico Sennrich University of Edinburgh R. Sennrich MT – 2018 – 03 1 / 23

Upload: others

Post on 08-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Machine Translation03: Language Models

Rico Sennrich

University of Edinburgh

R. Sennrich MT – 2018 – 03 1 / 23

Page 2: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Overview

last lecturesimple neural networks

real-valued vectors as input and output

today’s lecturehow do we represent language in neural networks?

how do we treat language probabilistically (with neural networks)?

R. Sennrich MT – 2018 – 03 1 / 23

Page 3: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

A probabilistic model of translation

Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = arg maxT

P (T |S)

Expanding using the chain rule gives

P (T |S) = P (y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

P (yi|y1, . . . , yi−1, x1, . . . , xm)

R. Sennrich MT – 2018 – 03 2 / 23

Page 4: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

A probabilistic model of language

simpler: instead of P (T |S), what is P (T )?why do we care?

language model is integral part of statistical machine translation:

T ∗ = arg maxT

P (S|T )P (T ) Bayes’ theorem

T ∗ ≈ arg maxT

M∑m=1

λmhm(S, T ) [Och, 2003]

in neural machine translation, separate language model is untypical,but architectures are similarlanguage models have many other applications in NLP:

language identificationpredictive typingas a component in various NLP tasks

R. Sennrich MT – 2018 – 03 3 / 23

Page 5: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

N-gram language model

chain rule and Markov assumptiona sentence T of length n is a sequence x1, . . . , xn

P (T ) = P (x1, . . . , xn)

=

n∏i=1

P (xi|x0, . . . , xi−1) (chain rule)

≈n∏

i=1

P (xi|xi−k, . . . , xi−1) (Markov assumption: n-gram model)

R. Sennrich MT – 2018 – 03 4 / 23

Page 6: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Discrete n-gram models

count-based modelsestimate probability of n-grams via counting

main challenge: how to estimate probability of unseen n-grams?

n-gramszerogram uniform distribution

unigram probability estimated from word frequency

bigram xi depends only on xi−1

trigram xi depends only on xi−2, xi−1

R. Sennrich MT – 2018 – 03 5 / 23

Page 7: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Example: bigrams

German police have recovered more than 100 items stolen from John Lennon’s estate, including threediaries. The diaries were put on display at Berlin police headquarters with other items including a taperecording of a Beatles concert, two pairs of glasses, sheet music and a cigarette case. Police said a58-year-old man had been arrested on suspicion of handling stolen goods. The items were stolen inNew York in 2006 from Lennon’s widow, Yoko Ono. Detectives said much of the haul was confiscatedfrom an auction house in Berlin in July, sparking an investigation to find the rest of the stolen items.Ono identified the objects from photos she was shown at the German consulate in New York, Germanmedia reported.

R. Sennrich MT – 2018 – 03 6 / 23

Page 8: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Example: bigrams

German police have recovered more than 100 items stolen from John Lennon’s estate, including threediaries. The diaries were put on display at Berlin police headquarters with other items including a taperecording of a Beatles concert, two pairs of glasses, sheet music and a cigarette case. Police said a58-year-old man had been arrested on suspicion of handling stolen goods. The items were stolen inNew York in 2006 from Lennon’s widow, Yoko Ono. Detectives said much of the haul was confiscatedfrom an auction house in Berlin in July, sparking an investigation to find the rest of the stolen items.Ono identified the objects from photos she was shown at the German consulate in New York, Germanmedia reported.

P (of) =5

101≈ 0.041

R. Sennrich MT – 2018 – 03 6 / 23

Page 9: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Example: bigrams

German police have recovered more than 100 items stolen from John Lennon’s estate, including threediaries. The diaries were put on display at Berlin police headquarters with other items including a taperecording of a Beatles concert, two pairs of glasses, sheet music and a cigarette case. Police said a58-year-old man had been arrested on suspicion of handling stolen goods. The items were stolen inNew York in 2006 from Lennon’s widow, Yoko Ono. Detectives said much of the haul was confiscatedfrom an auction house in Berlin in July, sparking an investigation to find the rest of the stolen items.Ono identified the objects from photos she was shown at the German consulate in New York, Germanmedia reported.

P (the|of) =P (of the)

P (of)≈ 0.0165

0.041≈ 0.4

=C(of the)

C(of)=

2

5= 0.4

R. Sennrich MT – 2018 – 03 6 / 23

Page 10: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Sparse data

what is probability of "of his"?→ unseen in our training data, so we estimate P (of his) ≈ 0

can we simply use more training data?→ for higher n, most n-grams will be unseen

Google n-grams

length number ratio

1 13,588,391 1.002 314,843,401 1.70e-063 977,069,902 3.89e-134 1,313,818,354 3.85e-205 1,176,470,663 2.54e-27

Tokens 1,024,908,267,229

R. Sennrich MT – 2018 – 03 7 / 23

Page 11: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Smoothing

core idea: reserve part of probability mass for unseen events.most popular: back-off smoothing: if n-gram is unseen, makeestimate based on smaller n-grams

example: Jelinek-Mercer smoothing

psmooth(x|h) =

{α(x|h) + γ(h)β(x|h) C(x, h) > 0

γ(h)β(x|h) C(x, h) = 0

= α(x|h) + γ(h)β(x|h)

α(x|h) = λ(h)pML(x|h) = λ(h)C(x, h)

C(h)

γ(h) = 1− λ(h)

β(xi|xi−1i−n+1) = psmooth(xi|xi−1

i−n+2)

R. Sennrich MT – 2018 – 03 8 / 23

Page 12: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Continuous n-gram language models

core idea: rather than backing off, rely on similarity in internalrepresentation for estimating unseen events:

P (barks|the Rottweiler) ≈ P (barks|the Terrier)

R. Sennrich MT – 2018 – 03 9 / 23

Page 13: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Continuous n-gram language models

[Vaswani et al., 2013]

n-gram NNLM [Bengio et al., 2003]input: context of n-1 previous words

output: probability distribution for next word

linear embedding layer with shared weights

one or several hidden layers

R. Sennrich MT – 2018 – 03 10 / 23

Page 14: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Representing words as vectors

One-hot encodingexample vocabulary: ’man, ’runs’, ’the’, ’.’

input/output for p(runs|the man):

x0 =

0010

x1 =

1000

ytrue =

0100

size of input/output vector: vocabulary sizeembedding layer is lower-dimensional and dense

smaller weight matricesnetwork learns to group similar words to similar point in vector space

R. Sennrich MT – 2018 – 03 11 / 23

Page 15: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Multi-class logistic regression

1

x1

x2

· · ·

xn

∑∑∑

P (c = 0|Θ, X)

P (c = 1|Θ, X)

P (c = 2|Θ, X)

θ(0)0

θ(0)1

θ(0)2

θ(0)n

θ(2)n

Features

Input layer

Layer 1

g(z) = softmax(z)

R. Sennrich MT – 2018 – 03 12 / 23

Page 16: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Softmax activation function

softmax function

p(y = j|x) =exj∑k e

xk

softmax function normalizes output vector to probability distribution→ computational cost linear to vocabulary size (!)

R. Sennrich MT – 2018 – 03 13 / 23

Page 17: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Cross-entropy Loss

goal: predict probability 1 for correct word; 0 for rest:

L(Θ) = −c∑

k=1

δ(y, k) log p(k|x,Θ)

δ(x, y) =

{1 if x = y0 otherwise

simplified:L(Θ) = − log p(y|x,Θ)

R. Sennrich MT – 2018 – 03 14 / 23

Page 18: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Feedforward neural language model: math

[Vaswani et al., 2013]

h1 = ϕW1(Ex1, Ex2)

y = softmax(W2h1)

R. Sennrich MT – 2018 – 03 15 / 23

Page 19: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Recurrent neural network language model (RNNLM)

RNNLM [Mikolov et al., 2010]motivation: condition on arbitrarily long context→ no Markov assumption

we read in one word at a time, and update hidden state incrementally

hidden state is initialized as empty vector at time step 0parameters:

embedding matrix Efeedforward matrices W1, W2

recurrent matrix U

hi =

{0, , if i = 0

tanh(W1Exi + Uhi−1) , if i > 0

yi = softmax(W2hi−1)

R. Sennrich MT – 2018 – 03 16 / 23

Page 20: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Cross-entropy Loss in RNNs

unrolling RNN produces acyclic network (with shared weights)backpropagation like with feed-forward networkeach time step contributes to (shared) weight update

loss is applied to every word:

L(Θ) = −n∑

i=1

log p(xi|x1, . . . , xi−1,Θ)

teacher forcing: for each word, condition prediction on true history→ efficient, but mismatch to test time (where history is unreliable)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 17 / 23

Page 21: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

RNNs and long distance dependencies

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 18 / 23

Page 22: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

RNNs and long distance dependencies

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 18 / 23

Page 23: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 19 / 23

Page 24: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 19 / 23

Page 25: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM) – Step-by-step

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 20 / 23

Page 26: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM) – Step-by-step

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 20 / 23

Page 27: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM) – Step-by-step

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 20 / 23

Page 28: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Long Short-Term Memory (LSTM) – Step-by-step

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 20 / 23

Page 29: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Gated Recurrent Units (GRUs)

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 21 / 23

Page 30: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

RNN variants

gated unitssigmoid layers σ act as “gates” that control flow of information

reduces vanishing gradient problem

strong empirical results

Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 22 / 23

Page 31: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Further Reading

Required ReadingKoehn, 13.4

Optional ReadingBasic probability theory (Sharon Golwater):http://homepages.inf.ed.ac.uk/sgwater/math_tutorials.html

introduction to LSTMs:http://colah.github.io/posts/2015-08-Understanding-LSTMs/

R. Sennrich MT – 2018 – 03 23 / 23

Page 32: 03: Language Models Rico Sennrich - Informatics ...homepages.inf.ed.ac.uk/rsennric/mt18/3.pdfUniversity of Edinburgh R. Sennrich MT – 2018 – 03 1/23 Overview last lecture simple

Bibliography I

Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003).A Neural Probabilistic Language Model.J. Mach. Learn. Res., 3:1137–1155.

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., and Khudanpur, S. (2010).Recurrent neural network based language model.InINTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010,pages 1045–1048.

Och, F. J. (2003).Minimum Error Rate Training in Statistical Machine Translation.In ACL ’03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 160–167, Sapporo,Japan. Association for Computational Linguistics.

Vaswani, A., Zhao, Y., Fossum, V., and Chiang, D. (2013).Decoding with Large-Scale Neural Language Models Improves Translation.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pages1387–1392, Seattle, Washington, USA.

R. Sennrich MT – 2018 – 03 24 / 23