new developments in neural language modeling - inriadeeploria.gforge.inria.fr/adjitalk.pdf · new...

New Developments inNeural Language Modeling

Adji Bousso Dieng

Joint work with Chong Wang, Jianfeng Gao, and John Paisley

Language modeling applications

Language modeling

• Denote by w1, ..., wn a sequence of n words.

• Example: (it, is, sunny, today)

• A language model computes → P (w1, ..., wn)

• The chain rule of probability tells us that

P (w1, ..., wn) = P (w1)

n∏i=2

P (wi|w1:i−1)

Goal → compute these conditional probabilities.

N-grams

Unigram: Independence assumption (bag-of-words model)

P (w1, ..., wn) =

n∏i=1

P (wi)

N-gram: Markov assumption of order N

P (w1, ..., wn) = P (w1)

n∏i=2

P (wi|wi−1:i−N+1)

Learn model with maximum likelihood

Problem → poor generalization, curse of dimensionality.

Feedforward neural networks

Source: Bengio et al. 2003

Problem → assumes fixed context, only uses same sequence length.

Recurrent neural networks

st = f(Uxt +Wst−1)

st = g(s0, xt, xt−1, ..., x1) and g = f(f(f(...)))

ot = softmax(V st)

Problem → vanishing gradients, hidden state has limited capacity.

Current challenges and Motivation

“The U.S. presidential race isn’t only drawing attention and

controversy in the United States – it’s being closely watched across

the globe. But what does the rest of the world think about a campaignthat has already thrown up one surprise after another? CNN asked 10journalists for their take on the race so far, and what their country

might be hoping for in America ’ s next President ”

Intuition → syntax is local, semantic is global.

Interlude

Probabilistic Topic Models

Probabilistic Topic Models

Source: David Blei

Challenge

How can we combine ideas from probabilistic topic models andrecurrent neural network-based language models to capture both local

and global dependencies?

Existing Approach

Contextual RNN

Source: Mikolov 2012

st = g1(Uxt +Wst−1 + Ff(t))

yt = softmax(V st +Gf(t))

New Approach: TopicRNN

TopicRNN: A Generative Model

B B B B

V V VV VV V V

U U U U U U

W W W W W

Model ... unrolled architecture

ht = g1(Uxt +Wht−1)

yt = softmax(V ht + (1− lt)Bθ) and lt ∼ Bernoulli(Γht)

TopicRNN: A Generative ModelXc (bag-of-words)stop words excluded

X (full document)stop words included

Y (target document)

RNN

U

VB

W

Inference ... end-to-end architecture

q(θ|Xc) is the recognition network (MLP) that embeds the bag ofwords representation of the document

Maximum Likelihood

Ideally, maximize the log marginal likelihood of the observed sequencey1:T , l1:T

log p(y1:T , l1:T |ht) = log

∫p(θ)

T∏t=1

p(yt|ht, lt, θ)p(lt|ht)dθ.

Problem → intractable ...

Variational Objectives: ELBO

ELBO(Θ) = Eq(θ|Xc)

[T∑t=1

log p(yt, lt|ht, θ)

]−KL(q(θ|Xc)||p(θ))

ELBO(Θ) ≤ log p(y1:T , l1:T |ht,Θ)

Maximize a lower bound to the log marginal likelihood

Learning: end-to-end via backpropagation using ELBO and Adam

Empirical evidence

Promising Results on Word Prediction

10 Neurons Valid TestRNN (no features) 239.2 225.0RNN (LDA features) 197.3 187.4TopicRNN 184.5 172.2TopicLSTM 188.0 175.0TopicGRU 178.3 166.7

100 Neurons Valid TestRNN (no features) 150.1 142.1RNN (LDA features) 132.3 126.4TopicRNN 128.5 122.3TopicLSTM 126.0 118.1TopicGRU 118.3 112.4

300 Neurons Valid TestRNN (no features) – 124.7RNN (LDA features) – 113.7TopicRNN 118.3 112.2TopicLSTM 104.1 99.5TopicGRU 99.6 97.3

Perplexity scores on PTB for different network sizes and models.

Inferred Topics

Law Company Parties Trading Carslaw spending democratic stock gm

lawyers sales republicans sp autojudge advertising gop price fordrights employees republican investor jaguar

attorney state senate standard carcourt taxes oakland chairman cars

general fiscal highway investors headquarterscommon appropriation democrats retirement british

mr budget bill holders executivesinsurance ad district merrill model

Table: Five Topics from the TopicRNN Model

Inferred Document Distributions

0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16Inferred Topic Distribution from TopicGRU

0 10 20 30 40 500.00

0.05

0.10

0.15

0.20


0 10 20 30 40 500.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14


Figure: Inferred distributions using TopicGRU on threedifferent documents. The content of these documents is addedon the appendix. This shows that some of the topics are beingpicked up depending on the input document.

Unsupervised Feature Extraction

Clustered learned features from IMDB 100K movie reviews.

Promising Results on Sentiment Classification

Model Reported Classification Error rateBoW (bnc) (Maas et al., 2011) 12.20%BoW (b∆ tc) (Maas et al., 2011) 11.77%LDA (Maas et al., 2011) 32.58%Full + BoW (Maas et al., 2011) 11.67%Full + Unlabelled + BoW (Maas et al., 2011) 11.11%WRRBM (Dahl et al., 2012) 12.58%WRRBM + BoW (bnc) (Dahl et al., 2012) 10.77%MNB-uni (Wang & Manning, 2012) 16.45%MNB-bi (Wang & Manning, 2012) 13.41%SVM-uni (Wang & Manning, 2012) 13.05%SVM-bi (Wang & Manning, 2012) 10.84%NBSVM-uni (Wang & Manning, 2012) 11.71%seq2-bown-CNN (Johnson & Zhang, 2014) 14.70%NBSVM-bi (Wang & Manning, 2012) 8.78%Paragraph Vector (Le & Mikolov, 2014) 7.42%SA-LSTM with joint training (Dai & Le, 2015) 14.70%LSTM with tuning and dropout (Dai & Le, 2015) 13.50%LSTM initialized with word2vec embeddings (Dai & Le, 2015) 10.00%SA-LSTM with linear gain (Dai & Le, 2015) 9.17%LM-TM (Dai & Le, 2015) 7.64%SA-LSTM (Dai & Le, 2015) 7.24%TopicRNN 6.28%

Remaining challenges

1. Learning to encode and decode rare words well.

2. More on capturing long term dependencies.

new developments in neural language modeling - inriadeeploria.gforge.inria.fr/adjitalk.pdf · new...

Documents