edinburgh mt lecture 11: neural language models
TRANSCRIPT
“Neural” language models
Probabilistic language models
• Language models play the role of ...
• a judge of grammaticality
• a judge of semantic plausibility
• an enforcer of stylistic consistency
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
What is the probability of a sentence?
• Requirements
• Assign a probability to every sentence (i.e., string of words)
• Questions
• How many sentences are there in English?
• Too many :)
What is the probability of a sentence?
• Requirements
• Assign a probability to every sentence (i.e., string of words)
• Questions
• How many sentences are there in English?
• Too many :)
X
e2⌃⇤
pLM(e) = 1
pLM(e) � 0 8e 2 ⌃⇤
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
=
n-gram LMs
Vector-valued random variable
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
n-gram LMs
⇡
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
n-gram LMs
⇡
= p(e1 | START)⇥Y
i=2
p(ei | ei�1)⇥ p(STOP | e`)
Outcome pthe 0.6
and 0.04
said 0.009
says 0.00001
of 0.1
why 0.1
Why 0.00008
restaurant 0.0000008
destitute 0.00000064
Outcome pthe 0.01
and 0.01
said 0.003
says 0.009
of 0.002
why 0.003
Why 0.0006
restaurant 0.2
destitute 0.1
p(· | in) p(· | the)
As we’ve seen, probability tables like this are the workhorses of language and translation modeling.
Whence parameters?
Whence parameters?
Estimation.
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
pMLE(call | friends) = count(friends call)
count(friends)
LM Evaluation• Extrinsic evaluation: build a new language model, use it
for some task (MT, ASR, etc.)
• Intrinsic: measure how good we are at modeling language
We will use perplexity to evaluate models
Given: w, pLM
PPL = 21
|w| log2 pLM(w)
0 PPL 1
Perplexity• Generally fairly good correlations with BLEU for n-gram
models
• Perplexity is a generalization of the notion of branching factor
• How many choices do I have at each position?
• State-of-the-art English LMs have PPL of ~100 word choices per position
• A uniform LM has a perplexity of
• Humans do much better
• ... and bad models can do even worse than uniform!
|⌃|
MLE & Perplexity
• What is the lowest (best) perplexity possible for your model class?
• Compute the MLE!
• Well, that’s easy...
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
-1.96773
-1.96773
MLE
MLE
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
-1.96773
-1.96773
MLE
MLE
MLE assigns probability zeroto unseen events
Zeros• Two kinds of zero probs:
• Sampling zeros: zeros in the MLE due to impoverished observations
• Structural zeros: zeros that should be there. Do these really exist?
• Just because you haven’t seen something, doesn’t mean it doesn’t exist.
• In practice, we don’t like probability zero, even if there is an argument that it is a structural zero.
Smoothing
p(e) > 0 8e 2 ⌃⇤
Smoothing an refers to a family of estimation techniques that seek to model important general patterns in data while avoiding modeling noise or samplingartifacts. In particular, for language modeling, we seek
We will assume that is known and finite.⌃
Add- Smoothingp ⇠ Dirichlet(↵)
xi ⇠ Categorical(p) 81 i |x|
Assuming this model, what is the most probable value of p, having observed training data x?
(bunch of calculus - read about it on Wikipedia)
p
⇤x
=
count(x) + ↵
x
� 1
N +
Px
0 (↵x
0 � 1)
8↵x
> 1
↵
• Simplest possible smoother
• Surprisingly effective in many models
• Does not work well for language models
• There are procedures for dealing with 0 < alpha < 1
• When might these be useful?
Add- Smoothing↵
Interpolation
• “Mixture of MLEs”
p(dub | my friends) =�3pMLE(dub | my friends)
+ �2pMLE(dub | friends)+ �1pMLE(dub)
+ �01
|⌃|
Where do the lambdas come from?
DiscountingDiscounting adjusts the frequencies of observedevents downward to reserve probability for thethings that have not been observed.
Note only when f(w3 | w1, w2) > 0 count(w1, w2, w3) > 0
0 f⇤(w3 | w1, w2) f(w3 | w1, w2)
We introduce a discounted frequency:
The total discount is the zero-frequency probability:
�(w1, w2) = 1�X
w0
f⇤(w0 | w1, w2)
Back-off
pBO(w3 | w1, w2) =
(f⇤
(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0
↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”
Question 1: how do we discount?
Recursive formulation of probability:
Kneser-Ney Discounting• State-of-the-art in language modeling for 15 years
• Two major intuitions
• Some contexts have lots of new words
• Some words appear in lots of contexts
• Procedure
• Only register a lower-order count the first time it is seen in a backoff context
• Example: bigram model
• “San Francisco” is a common bigram
• But, we only count the unigram “Francisco” the first time we see the bigram “San Francisco” - we change its unigram probability
Back-off
pBO(w3 | w1, w2) =
(f⇤
(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0
↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”
Question 1: how do we discount?
Recursive formulation of probability:
Question 2: how many parameters?
x1
x2
x3
x4
x5
x
“Neurons”
x1
x2
x3
x4
x5
x
“Neurons”
“While the brain metaphor is intriguing, it is also distracting and cumbersome to manipulate
mathematically. We therefore switch to using more concise
mathematical notation.”(Goldberg, 2015)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
x w
“Neurons”
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·w
x
“Neurons”
w
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·w
x
“Neurons”
w
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·wy = g(x ·w)
x
“Neurons”
w
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
x
“Neurons”
w
x1
x2
x3
x4
x5
x
“Neural” Networks
x1
x2
x3
x4
x5
x
w4,1
“Neural” Networks
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
“Neural” Networks
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
“Neural” Networks
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
“Neural” Networks
What about probabilities?Let each be an outcome.yi
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
g(u)i =expuiPi0 expui0
“Soft max”
“Neural” Networks
What about probabilities?Let each be an outcome.yi
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
z = g(Vh(Wx))
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
z = g(Vh(Wx))
Note:if g(x) = h(x) = x
z = (VW)| {z }U
x
Design Decisions
• How to represent inputs and outputs?
• Neural architecture?
• How many layers? (Requires non-linearities to improve capacity!)
• How many neurons?
• Recurrent or not?
• What kind of non-linearities?
Representing Language
• “One-hot” vectors
• Each position in a vector corresponds to a word type
• Sequence of words, sequence of vectors
• Bag of words: multiple vectors
• Distributed representations
• Vectors encode “features” of input words (character n-grams, morphological features, etc.)
Training Neural Networks
• Neural networks are supervised models - you need a set of inputs paired with outputs
• Algorithm
• Run for a while
• Give input to the network, see what it predicts
• Compute loss(y,y*) and (sub)gradient with respect to parameters. Use the chain rule, aka “back propagation”
• Update parameters (SGD, AdaGrad, LBFGS, etc.)
• Algorithm is automated, just need to specify model structure
Bengio et al. (2003)
p(e) =
|e|Y
i=1
p(ei | ei�n+1, . . . , ei�1)
p(ei | ei�n+1, . . . , ei�1) =
ei
ei�1
ei�2
ei�3
C
C
C
W V
tanhsoftmax
x=x
Bengio et al. (2003)
Devlin et al. (2014)
• Turn Bengio et al. (2003) into a translation model
• Conditional model; generate the next English word conditioned on
• The previous n English words you generated
• The aligned source word, and its m neighbors
eiei�1
ei�2
ei�3
C
C
C
W V
tanh
softmax
x=x
p(e | f,a) =|e|Y
i=1
p(ei | ei�2, ei�1, fai�1, fai , fai+1)
p(ei | ei�2, ei�1, fai�1, fai , fai+1) =
D
D
D
fai
fai�1
fai+1
Devlin et al. (2014)
Summary of neural LMs
• Two problems in standard statistical models
• We don’t condition on enough stuff
• We don’t know what features to use when we condition on lots of structure
• Punchline: Neural networks let us condition on a lot of stuff without an exponential growth in parameters
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
Conditional language model
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
WHAT IF I TOLD YOU
YOU CAN ENCODE A SENTENCE IN A FIXED-DIMENSIONAL VECTOR