edinburgh mt lecture 11: neural language models

“Neural” language models

Probabilistic language models

• Language models play the role of ...

• a judge of grammaticality

• a judge of semantic plausibility

• an enforcer of stylistic consistency

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

What is the probability of a sentence?

• Requirements

• Assign a probability to every sentence (i.e., string of words)

• Questions

• How many sentences are there in English?

• Too many :)

What is the probability of a sentence?

• Requirements

• Assign a probability to every sentence (i.e., string of words)

• Questions

• How many sentences are there in English?

• Too many :)

X

e2⌃⇤

pLM(e) = 1

pLM(e) � 0 8e 2 ⌃⇤

p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)

pLM(e) =p(e1, e2, e3, . . . , e`)

=

n-gram LMs

Vector-valued random variable


pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

⇡


pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

⇡

= p(e1 | START)⇥Y

i=2

p(ei | ei�1)⇥ p(STOP | e`)

Outcome pthe 0.6

and 0.04

said 0.009

says 0.00001

of 0.1

why 0.1

Why 0.00008

restaurant 0.0000008

destitute 0.00000064

Outcome pthe 0.01

and 0.01

said 0.003

says 0.009

of 0.002

why 0.003

Why 0.0006

restaurant 0.2

destitute 0.1

p(· | in) p(· | the)

As we’ve seen, probability tables like this are the workhorses of language and translation modeling.

Whence parameters?

Whence parameters?

Estimation.

p(x | y) = p(x, y)

p(y)

pMLE(x) =count(x)

N

pMLE(x, y) =count(x, y)

N

pMLE(x | y) = count(x, y)

N

⇥ N

count(y)

=

count(x, y)

count(y)

p(x | y) = p(x, y)

p(y)

pMLE(x) =count(x)

N

pMLE(x, y) =count(x, y)

N

pMLE(x | y) = count(x, y)

N

⇥ N

count(y)

=

count(x, y)

count(y)

pMLE(call | friends) = count(friends call)

count(friends)

LM Evaluation• Extrinsic evaluation: build a new language model, use it

for some task (MT, ASR, etc.)

• Intrinsic: measure how good we are at modeling language

We will use perplexity to evaluate models

Given: w, pLM

PPL = 21

|w| log2 pLM(w)

0 PPL 1

Perplexity• Generally fairly good correlations with BLEU for n-gram

models

• Perplexity is a generalization of the notion of branching factor

• How many choices do I have at each position?

• State-of-the-art English LMs have PPL of ~100 word choices per position

• A uniform LM has a perplexity of

• Humans do much better

• ... and bad models can do even worse than uniform!

|⌃|

MLE & Perplexity

• What is the lowest (best) perplexity possible for your model class?

• Compute the MLE!

• Well, that’s easy...

Zeros• Two kinds of zero probs:

• Sampling zeros: zeros in the MLE due to impoverished observations

• Structural zeros: zeros that should be there. Do these really exist?

• Just because you haven’t seen something, doesn’t mean it doesn’t exist.

• In practice, we don’t like probability zero, even if there is an argument that it is a structural zero.

Smoothing

p(e) > 0 8e 2 ⌃⇤

Smoothing an refers to a family of estimation techniques that seek to model important general patterns in data while avoiding modeling noise or samplingartifacts. In particular, for language modeling, we seek

We will assume that is known and finite.⌃

Add- Smoothingp ⇠ Dirichlet(↵)

xi ⇠ Categorical(p) 81 i |x|

Assuming this model, what is the most probable value of p, having observed training data x?

(bunch of calculus - read about it on Wikipedia)

p

⇤x

=

count(x) + ↵

x

� 1

N +

Px

0 (↵x

0 � 1)

8↵x

> 1

↵

• Simplest possible smoother

• Surprisingly effective in many models

• Does not work well for language models

• There are procedures for dealing with 0 < alpha < 1

• When might these be useful?

Add- Smoothing↵

DiscountingDiscounting adjusts the frequencies of observedevents downward to reserve probability for thethings that have not been observed.

Note only when f(w3 | w1, w2) > 0 count(w1, w2, w3) > 0

0 f⇤(w3 | w1, w2) f(w3 | w1, w2)

We introduce a discounted frequency:

The total discount is the zero-frequency probability:

�(w1, w2) = 1�X

w0

f⇤(w0 | w1, w2)

Back-off

pBO(w3 | w1, w2) =

(f⇤

(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0

↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”

Question 1: how do we discount?

Recursive formulation of probability:

Kneser-Ney Discounting• State-of-the-art in language modeling for 15 years

• Two major intuitions

• Some contexts have lots of new words

• Some words appear in lots of contexts

• Procedure

• Only register a lower-order count the first time it is seen in a backoff context

• Example: bigram model

• “San Francisco” is a common bigram

• But, we only count the unigram “Francisco” the first time we see the bigram “San Francisco” - we change its unigram probability

Back-off

pBO(w3 | w1, w2) =

(f⇤

(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0

↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”

Question 1: how do we discount?

Recursive formulation of probability:

Question 2: how many parameters?

x1

x2

x3

x4

x5

x

“Neurons”

x1

x2

x3

x4

x5

x

“Neurons”

“While the brain metaphor is intriguing, it is also distracting and cumbersome to manipulate

mathematically. We therefore switch to using more concise

mathematical notation.”(Goldberg, 2015)

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

x w

“Neurons”

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·w

x

“Neurons”

w

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·wy = g(x ·w)

x

“Neurons”

w

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

x

“Neurons”

w

x1

x2

x3

x4

x5

x

“Neural” Networks

x1

x2

x3

x4

x5

x

w4,1


x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W


x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)


x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)


What about probabilities?Let each be an outcome.yi

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

g(u)i =expuiPi0 expui0

“Soft max”


What about probabilities?Let each be an outcome.yi

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

Note:if g(x) = h(x) = x

z = (VW)| {z }U

x

Design Decisions

• How to represent inputs and outputs?

• Neural architecture?

• How many layers? (Requires non-linearities to improve capacity!)

• How many neurons?

• Recurrent or not?

• What kind of non-linearities?

Representing Language

• “One-hot” vectors

• Each position in a vector corresponds to a word type

• Sequence of words, sequence of vectors

• Bag of words: multiple vectors

• Distributed representations

• Vectors encode “features” of input words (character n-grams, morphological features, etc.)

Training Neural Networks

• Neural networks are supervised models - you need a set of inputs paired with outputs

• Algorithm

• Run for a while

• Give input to the network, see what it predicts

• Compute loss(y,y*) and (sub)gradient with respect to parameters. Use the chain rule, aka “back propagation”

• Update parameters (SGD, AdaGrad, LBFGS, etc.)

• Algorithm is automated, just need to specify model structure

Bengio et al. (2003)

p(e) =

|e|Y

i=1

p(ei | ei�n+1, . . . , ei�1)

p(ei | ei�n+1, . . . , ei�1) =

ei

ei�1

ei�2

ei�3

C

C

C

W V

tanhsoftmax

x=x

Bengio et al. (2003)

Devlin et al. (2014)

• Turn Bengio et al. (2003) into a translation model

• Conditional model; generate the next English word conditioned on

• The previous n English words you generated

• The aligned source word, and its m neighbors

Devlin et al. (2014)

Summary of neural LMs

• Two problems in standard statistical models

• We don’t condition on enough stuff

• We don’t know what features to use when we condition on lots of structure

• Punchline: Neural networks let us condition on a lot of stuff without an exponential growth in parameters


=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)


=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

Conditional language model


=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

WHAT IF I TOLD YOU

YOU CAN ENCODE A SENTENCE IN A FIXED-DIMENSIONAL VECTOR

edinburgh mt lecture 11: neural language models

Technology