new developments in neural language modeling - inriadeeploria.gforge.inria.fr/adjitalk.pdf · new...

26
New Developments in Neural Language Modeling Adji Bousso Dieng Joint work with Chong Wang, Jianfeng Gao, and John Paisley

Upload: trinhngoc

Post on 11-Feb-2018

232 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

New Developments inNeural Language Modeling

Adji Bousso Dieng

Joint work with Chong Wang, Jianfeng Gao, and John Paisley

Page 2: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Language modeling applications

Page 3: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Language modeling

• Denote by w1, ..., wn a sequence of n words.

• Example: (it, is, sunny, today)

• A language model computes → P (w1, ..., wn)

• The chain rule of probability tells us that

P (w1, ..., wn) = P (w1)

n∏i=2

P (wi|w1:i−1)

Goal → compute these conditional probabilities.

Page 4: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

N-grams

Unigram: Independence assumption (bag-of-words model)

P (w1, ..., wn) =

n∏i=1

P (wi)

N-gram: Markov assumption of order N

P (w1, ..., wn) = P (w1)

n∏i=2

P (wi|wi−1:i−N+1)

Learn model with maximum likelihood

Problem → poor generalization, curse of dimensionality.

Page 5: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Feedforward neural networks

Source: Bengio et al. 2003

Problem → assumes fixed context, only uses same sequence length.

Page 6: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Recurrent neural networks

st = f(Uxt +Wst−1)

st = g(s0, xt, xt−1, ..., x1) and g = f(f(f(...)))

ot = softmax(V st)

Problem → vanishing gradients, hidden state has limited capacity.

Page 7: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Current challenges and Motivation

“The U.S. presidential race isn’t only drawing attention and

controversy in the United States – it’s being closely watched across

the globe. But what does the rest of the world think about a campaignthat has already thrown up one surprise after another? CNN asked 10journalists for their take on the race so far, and what their country

might be hoping for in America ’ s next President ”

Intuition → syntax is local, semantic is global.

Page 8: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Interlude

Probabilistic Topic Models

Page 9: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Probabilistic Topic Models

Source: David Blei

Page 10: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Probabilistic Topic Models

Source: David Blei

Page 11: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Probabilistic Topic Models

Source: David Blei

Page 12: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Challenge

How can we combine ideas from probabilistic topic models andrecurrent neural network-based language models to capture both local

and global dependencies?

Page 13: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Existing Approach

Page 14: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Contextual RNN

Source: Mikolov 2012

st = g1(Uxt +Wst−1 + Ff(t))

yt = softmax(V st +Gf(t))

Page 15: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

New Approach: TopicRNN

Page 16: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

TopicRNN: A Generative Model

B B B B

V V VV VV V V

U U U U U U

W W W W W

Model ... unrolled architecture

ht = g1(Uxt +Wht−1)

yt = softmax(V ht + (1− lt)Bθ) and lt ∼ Bernoulli(Γht)

Page 17: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

TopicRNN: A Generative ModelXc (bag-of-words)stop words excluded

X (full document)stop words included

Y (target document)

RNN

U

VB

W

Inference ... end-to-end architecture

q(θ|Xc) is the recognition network (MLP) that embeds the bag ofwords representation of the document

Page 18: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Maximum Likelihood

Ideally, maximize the log marginal likelihood of the observed sequencey1:T , l1:T

log p(y1:T , l1:T |ht) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ)p(lt|ht)dθ.

Problem → intractable ...

Page 19: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Variational Objectives: ELBO

ELBO(Θ) = Eq(θ|Xc)

[T∑t=1

log p(yt, lt|ht, θ)

]−KL(q(θ|Xc)||p(θ))

ELBO(Θ) ≤ log p(y1:T , l1:T |ht,Θ)

Maximize a lower bound to the log marginal likelihood

Learning: end-to-end via backpropagation using ELBO and Adam

Page 20: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Empirical evidence

Page 21: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Promising Results on Word Prediction

10 Neurons Valid TestRNN (no features) 239.2 225.0RNN (LDA features) 197.3 187.4TopicRNN 184.5 172.2TopicLSTM 188.0 175.0TopicGRU 178.3 166.7

100 Neurons Valid TestRNN (no features) 150.1 142.1RNN (LDA features) 132.3 126.4TopicRNN 128.5 122.3TopicLSTM 126.0 118.1TopicGRU 118.3 112.4

300 Neurons Valid TestRNN (no features) – 124.7RNN (LDA features) – 113.7TopicRNN 118.3 112.2TopicLSTM 104.1 99.5TopicGRU 99.6 97.3

Perplexity scores on PTB for different network sizes and models.

Page 22: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Inferred Topics

Law Company Parties Trading Carslaw spending democratic stock gm

lawyers sales republicans sp autojudge advertising gop price fordrights employees republican investor jaguar

attorney state senate standard carcourt taxes oakland chairman cars

general fiscal highway investors headquarterscommon appropriation democrats retirement british

mr budget bill holders executivesinsurance ad district merrill model

Table: Five Topics from the TopicRNN Model

Page 23: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Inferred Document Distributions

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20

0.25Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

Figure: Inferred distributions using TopicGRU on threedifferent documents. The content of these documents is addedon the appendix. This shows that some of the topics are beingpicked up depending on the input document.

Page 24: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Unsupervised Feature Extraction

Clustered learned features from IMDB 100K movie reviews.

Page 25: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Promising Results on Sentiment Classification

Model Reported Classification Error rateBoW (bnc) (Maas et al., 2011) 12.20%BoW (b∆ tc) (Maas et al., 2011) 11.77%LDA (Maas et al., 2011) 32.58%Full + BoW (Maas et al., 2011) 11.67%Full + Unlabelled + BoW (Maas et al., 2011) 11.11%WRRBM (Dahl et al., 2012) 12.58%WRRBM + BoW (bnc) (Dahl et al., 2012) 10.77%MNB-uni (Wang & Manning, 2012) 16.45%MNB-bi (Wang & Manning, 2012) 13.41%SVM-uni (Wang & Manning, 2012) 13.05%SVM-bi (Wang & Manning, 2012) 10.84%NBSVM-uni (Wang & Manning, 2012) 11.71%seq2-bown-CNN (Johnson & Zhang, 2014) 14.70%NBSVM-bi (Wang & Manning, 2012) 8.78%Paragraph Vector (Le & Mikolov, 2014) 7.42%SA-LSTM with joint training (Dai & Le, 2015) 14.70%LSTM with tuning and dropout (Dai & Le, 2015) 13.50%LSTM initialized with word2vec embeddings (Dai & Le, 2015) 10.00%SA-LSTM with linear gain (Dai & Le, 2015) 9.17%LM-TM (Dai & Le, 2015) 7.64%SA-LSTM (Dai & Le, 2015) 7.24%TopicRNN 6.28%

Page 26: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:

Remaining challenges

1. Learning to encode and decode rare words well.

2. More on capturing long term dependencies.