edinburgh mt lecture 11: neural language models

69
“Neural” language models

Upload: alopezfoo

Post on 14-Apr-2017

449 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Edinburgh MT lecture 11: Neural language models

“Neural” language models

Page 2: Edinburgh MT lecture 11: Neural language models

Probabilistic language models

• Language models play the role of ...

• a judge of grammaticality

• a judge of semantic plausibility

• an enforcer of stylistic consistency

Page 3: Edinburgh MT lecture 11: Neural language models
Page 4: Edinburgh MT lecture 11: Neural language models

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

Page 5: Edinburgh MT lecture 11: Neural language models

What is the probability of a sentence?

• Requirements

• Assign a probability to every sentence (i.e., string of words)

• Questions

• How many sentences are there in English?

• Too many :)

Page 6: Edinburgh MT lecture 11: Neural language models

What is the probability of a sentence?

• Requirements

• Assign a probability to every sentence (i.e., string of words)

• Questions

• How many sentences are there in English?

• Too many :)

X

e2⌃⇤

pLM(e) = 1

pLM(e) � 0 8e 2 ⌃⇤

Page 7: Edinburgh MT lecture 11: Neural language models

p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)

pLM(e) =p(e1, e2, e3, . . . , e`)

=

n-gram LMs

Vector-valued random variable

Page 8: Edinburgh MT lecture 11: Neural language models

p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)

pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

Page 9: Edinburgh MT lecture 11: Neural language models

p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)

pLM(e) =p(e1, e2, e3, . . . , e`)

n-gram LMs

= p(e1 | START)⇥Y

i=2

p(ei | ei�1)⇥ p(STOP | e`)

Page 10: Edinburgh MT lecture 11: Neural language models

Outcome pthe 0.6

and 0.04

said 0.009

says 0.00001

of 0.1

why 0.1

Why 0.00008

restaurant 0.0000008

destitute 0.00000064

Outcome pthe 0.01

and 0.01

said 0.003

says 0.009

of 0.002

why 0.003

Why 0.0006

restaurant 0.2

destitute 0.1

p(· | in) p(· | the)

As we’ve seen, probability tables like this are the workhorses of language and translation modeling.

Page 11: Edinburgh MT lecture 11: Neural language models

Whence parameters?

Page 12: Edinburgh MT lecture 11: Neural language models

Whence parameters?

Estimation.

Page 13: Edinburgh MT lecture 11: Neural language models

p(x | y) = p(x, y)

p(y)

pMLE(x) =count(x)

N

pMLE(x, y) =count(x, y)

N

pMLE(x | y) = count(x, y)

N

⇥ N

count(y)

=

count(x, y)

count(y)

Page 14: Edinburgh MT lecture 11: Neural language models

p(x | y) = p(x, y)

p(y)

pMLE(x) =count(x)

N

pMLE(x, y) =count(x, y)

N

pMLE(x | y) = count(x, y)

N

⇥ N

count(y)

=

count(x, y)

count(y)

Page 15: Edinburgh MT lecture 11: Neural language models

p(x | y) = p(x, y)

p(y)

pMLE(x) =count(x)

N

pMLE(x, y) =count(x, y)

N

pMLE(x | y) = count(x, y)

N

⇥ N

count(y)

=

count(x, y)

count(y)

pMLE(call | friends) = count(friends call)

count(friends)

Page 16: Edinburgh MT lecture 11: Neural language models

LM Evaluation• Extrinsic evaluation: build a new language model, use it

for some task (MT, ASR, etc.)

• Intrinsic: measure how good we are at modeling language

We will use perplexity to evaluate models

Given: w, pLM

PPL = 21

|w| log2 pLM(w)

0 PPL 1

Page 17: Edinburgh MT lecture 11: Neural language models

Perplexity• Generally fairly good correlations with BLEU for n-gram

models

• Perplexity is a generalization of the notion of branching factor

• How many choices do I have at each position?

• State-of-the-art English LMs have PPL of ~100 word choices per position

• A uniform LM has a perplexity of

• Humans do much better

• ... and bad models can do even worse than uniform!

|⌃|

Page 18: Edinburgh MT lecture 11: Neural language models

MLE & Perplexity

• What is the lowest (best) perplexity possible for your model class?

• Compute the MLE!

• Well, that’s easy...

Page 19: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

Page 20: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

MLE

MLE

Page 21: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

MLE

MLE

Page 22: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

MLE

MLE

Page 23: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

-3.32231

-∞

MLE

MLE

Page 24: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

-3.32231

-∞

-0.271271

-2.54562

MLE

MLE

Page 25: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

-3.32231

-∞

-0.271271

-2.54562

-4.961

-4.961

MLE

MLE

Page 26: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

-3.32231

-∞

-0.271271

-2.54562

-4.961

-4.961

-1.96773

-1.96773

MLE

MLE

Page 27: Edinburgh MT lecture 11: Neural language models

START my

p(my | START)

friends

⇥p(friends | my)

call⇥p(call | friends)

me

⇥p(me | call)

Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)

START my

p(my | START)

friends

⇥p(friends | my)

dub me Alex

⇥p(Alex | me)

STOP

⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)

-3.65172

-3.65172

-2.07101

-2.07101

-3.32231

-∞

-0.271271

-2.54562

-4.961

-4.961

-1.96773

-1.96773

MLE

MLE

MLE assigns probability zeroto unseen events

Page 28: Edinburgh MT lecture 11: Neural language models

Zeros• Two kinds of zero probs:

• Sampling zeros: zeros in the MLE due to impoverished observations

• Structural zeros: zeros that should be there. Do these really exist?

• Just because you haven’t seen something, doesn’t mean it doesn’t exist.

• In practice, we don’t like probability zero, even if there is an argument that it is a structural zero.

Page 29: Edinburgh MT lecture 11: Neural language models

Smoothing

p(e) > 0 8e 2 ⌃⇤

Smoothing an refers to a family of estimation techniques that seek to model important general patterns in data while avoiding modeling noise or samplingartifacts. In particular, for language modeling, we seek

We will assume that is known and finite.⌃

Page 30: Edinburgh MT lecture 11: Neural language models

Add- Smoothingp ⇠ Dirichlet(↵)

xi ⇠ Categorical(p) 81 i |x|

Assuming this model, what is the most probable value of p, having observed training data x?

(bunch of calculus - read about it on Wikipedia)

p

⇤x

=

count(x) + ↵

x

� 1

N +

Px

0 (↵x

0 � 1)

8↵x

> 1

Page 31: Edinburgh MT lecture 11: Neural language models

• Simplest possible smoother

• Surprisingly effective in many models

• Does not work well for language models

• There are procedures for dealing with 0 < alpha < 1

• When might these be useful?

Add- Smoothing↵

Page 32: Edinburgh MT lecture 11: Neural language models

Interpolation

• “Mixture of MLEs”

p(dub | my friends) =�3pMLE(dub | my friends)

+ �2pMLE(dub | friends)+ �1pMLE(dub)

+ �01

|⌃|

Where do the lambdas come from?

Page 33: Edinburgh MT lecture 11: Neural language models

DiscountingDiscounting adjusts the frequencies of observedevents downward to reserve probability for thethings that have not been observed.

Note only when f(w3 | w1, w2) > 0 count(w1, w2, w3) > 0

0 f⇤(w3 | w1, w2) f(w3 | w1, w2)

We introduce a discounted frequency:

The total discount is the zero-frequency probability:

�(w1, w2) = 1�X

w0

f⇤(w0 | w1, w2)

Page 34: Edinburgh MT lecture 11: Neural language models

Back-off

pBO(w3 | w1, w2) =

(f⇤

(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0

↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”

Question 1: how do we discount?

Recursive formulation of probability:

Page 35: Edinburgh MT lecture 11: Neural language models

Kneser-Ney Discounting• State-of-the-art in language modeling for 15 years

• Two major intuitions

• Some contexts have lots of new words

• Some words appear in lots of contexts

• Procedure

• Only register a lower-order count the first time it is seen in a backoff context

• Example: bigram model

• “San Francisco” is a common bigram

• But, we only count the unigram “Francisco” the first time we see the bigram “San Francisco” - we change its unigram probability

Page 36: Edinburgh MT lecture 11: Neural language models

Back-off

pBO(w3 | w1, w2) =

(f⇤

(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0

↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”

Question 1: how do we discount?

Recursive formulation of probability:

Question 2: how many parameters?

Page 37: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

“Neurons”

Page 38: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

“Neurons”

“While the brain metaphor is intriguing, it is also distracting and cumbersome to manipulate

mathematically. We therefore switch to using more concise

mathematical notation.”(Goldberg, 2015)

Page 39: Edinburgh MT lecture 11: Neural language models

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

x w

“Neurons”

Page 40: Edinburgh MT lecture 11: Neural language models

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·w

x

“Neurons”

w

Page 41: Edinburgh MT lecture 11: Neural language models

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·w

x

“Neurons”

w

Page 42: Edinburgh MT lecture 11: Neural language models

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·wy = g(x ·w)

x

“Neurons”

w

Page 43: Edinburgh MT lecture 11: Neural language models

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

x

“Neurons”

w

Page 44: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

“Neural” Networks

Page 45: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

w4,1

“Neural” Networks

Page 46: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

“Neural” Networks

Page 47: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

“Neural” Networks

Page 48: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

“Neural” Networks

What about probabilities?Let each be an outcome.yi

Page 49: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

g(u)i =expuiPi0 expui0

“Soft max”

“Neural” Networks

What about probabilities?Let each be an outcome.yi

Page 50: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

Page 51: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V

Page 52: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

Page 53: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

Page 54: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

Page 55: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

Page 56: Edinburgh MT lecture 11: Neural language models

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

Note:if g(x) = h(x) = x

z = (VW)| {z }U

x

Page 57: Edinburgh MT lecture 11: Neural language models

Design Decisions

• How to represent inputs and outputs?

• Neural architecture?

• How many layers? (Requires non-linearities to improve capacity!)

• How many neurons?

• Recurrent or not?

• What kind of non-linearities?

Page 58: Edinburgh MT lecture 11: Neural language models

Representing Language

• “One-hot” vectors

• Each position in a vector corresponds to a word type

• Sequence of words, sequence of vectors

• Bag of words: multiple vectors

• Distributed representations

• Vectors encode “features” of input words (character n-grams, morphological features, etc.)

Page 59: Edinburgh MT lecture 11: Neural language models

Training Neural Networks

• Neural networks are supervised models - you need a set of inputs paired with outputs

• Algorithm

• Run for a while

• Give input to the network, see what it predicts

• Compute loss(y,y*) and (sub)gradient with respect to parameters. Use the chain rule, aka “back propagation”

• Update parameters (SGD, AdaGrad, LBFGS, etc.)

• Algorithm is automated, just need to specify model structure

Page 60: Edinburgh MT lecture 11: Neural language models

Bengio et al. (2003)

p(e) =

|e|Y

i=1

p(ei | ei�n+1, . . . , ei�1)

p(ei | ei�n+1, . . . , ei�1) =

ei

ei�1

ei�2

ei�3

C

C

C

W V

tanhsoftmax

x=x

Page 61: Edinburgh MT lecture 11: Neural language models

Bengio et al. (2003)

Page 62: Edinburgh MT lecture 11: Neural language models

Devlin et al. (2014)

• Turn Bengio et al. (2003) into a translation model

• Conditional model; generate the next English word conditioned on

• The previous n English words you generated

• The aligned source word, and its m neighbors

Page 63: Edinburgh MT lecture 11: Neural language models

eiei�1

ei�2

ei�3

C

C

C

W V

tanh

softmax

x=x

p(e | f,a) =|e|Y

i=1

p(ei | ei�2, ei�1, fai�1, fai , fai+1)

p(ei | ei�2, ei�1, fai�1, fai , fai+1) =

D

D

D

fai

fai�1

fai+1

Page 64: Edinburgh MT lecture 11: Neural language models

Devlin et al. (2014)

Page 65: Edinburgh MT lecture 11: Neural language models

Summary of neural LMs

• Two problems in standard statistical models

• We don’t condition on enough stuff

• We don’t know what features to use when we condition on lots of structure

• Punchline: Neural networks let us condition on a lot of stuff without an exponential growth in parameters

Page 66: Edinburgh MT lecture 11: Neural language models

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

Page 67: Edinburgh MT lecture 11: Neural language models

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

Page 68: Edinburgh MT lecture 11: Neural language models

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

Conditional language model

Page 69: Edinburgh MT lecture 11: Neural language models

LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)

=

|e|Y

i=1

p(ei | ei�1, ..., e1)⇥|f |Y

j=1

p(fj | fj�1, ..., f1, e)

WHAT IF I TOLD YOU

YOU CAN ENCODE A SENTENCE IN A FIXED-DIMENSIONAL VECTOR