new developments in neural language modeling - inriadeeploria.gforge.inria.fr/adjitalk.pdf · new...
TRANSCRIPT
New Developments inNeural Language Modeling
Adji Bousso Dieng
Joint work with Chong Wang, Jianfeng Gao, and John Paisley
Language modeling applications
Language modeling
• Denote by w1, ..., wn a sequence of n words.
• Example: (it, is, sunny, today)
• A language model computes → P (w1, ..., wn)
• The chain rule of probability tells us that
P (w1, ..., wn) = P (w1)
n∏i=2
P (wi|w1:i−1)
Goal → compute these conditional probabilities.
N-grams
Unigram: Independence assumption (bag-of-words model)
P (w1, ..., wn) =
n∏i=1
P (wi)
N-gram: Markov assumption of order N
P (w1, ..., wn) = P (w1)
n∏i=2
P (wi|wi−1:i−N+1)
Learn model with maximum likelihood
Problem → poor generalization, curse of dimensionality.
Feedforward neural networks
Source: Bengio et al. 2003
Problem → assumes fixed context, only uses same sequence length.
Recurrent neural networks
st = f(Uxt +Wst−1)
st = g(s0, xt, xt−1, ..., x1) and g = f(f(f(...)))
ot = softmax(V st)
Problem → vanishing gradients, hidden state has limited capacity.
Current challenges and Motivation
“The U.S. presidential race isn’t only drawing attention and
controversy in the United States – it’s being closely watched across
the globe. But what does the rest of the world think about a campaignthat has already thrown up one surprise after another? CNN asked 10journalists for their take on the race so far, and what their country
might be hoping for in America ’ s next President ”
Intuition → syntax is local, semantic is global.
Interlude
Probabilistic Topic Models
Probabilistic Topic Models
Source: David Blei
Probabilistic Topic Models
Source: David Blei
Probabilistic Topic Models
Source: David Blei
Challenge
How can we combine ideas from probabilistic topic models andrecurrent neural network-based language models to capture both local
and global dependencies?
Existing Approach
Contextual RNN
Source: Mikolov 2012
st = g1(Uxt +Wst−1 + Ff(t))
yt = softmax(V st +Gf(t))
New Approach: TopicRNN
TopicRNN: A Generative Model
B B B B
V V VV VV V V
U U U U U U
W W W W W
Model ... unrolled architecture
ht = g1(Uxt +Wht−1)
yt = softmax(V ht + (1− lt)Bθ) and lt ∼ Bernoulli(Γht)
TopicRNN: A Generative ModelXc (bag-of-words)stop words excluded
X (full document)stop words included
Y (target document)
RNN
U
VB
W
Inference ... end-to-end architecture
q(θ|Xc) is the recognition network (MLP) that embeds the bag ofwords representation of the document
Maximum Likelihood
Ideally, maximize the log marginal likelihood of the observed sequencey1:T , l1:T
log p(y1:T , l1:T |ht) = log
∫p(θ)
T∏t=1
p(yt|ht, lt, θ)p(lt|ht)dθ.
Problem → intractable ...
Variational Objectives: ELBO
ELBO(Θ) = Eq(θ|Xc)
[T∑t=1
log p(yt, lt|ht, θ)
]−KL(q(θ|Xc)||p(θ))
ELBO(Θ) ≤ log p(y1:T , l1:T |ht,Θ)
Maximize a lower bound to the log marginal likelihood
Learning: end-to-end via backpropagation using ELBO and Adam
Empirical evidence
Promising Results on Word Prediction
10 Neurons Valid TestRNN (no features) 239.2 225.0RNN (LDA features) 197.3 187.4TopicRNN 184.5 172.2TopicLSTM 188.0 175.0TopicGRU 178.3 166.7
100 Neurons Valid TestRNN (no features) 150.1 142.1RNN (LDA features) 132.3 126.4TopicRNN 128.5 122.3TopicLSTM 126.0 118.1TopicGRU 118.3 112.4
300 Neurons Valid TestRNN (no features) – 124.7RNN (LDA features) – 113.7TopicRNN 118.3 112.2TopicLSTM 104.1 99.5TopicGRU 99.6 97.3
Perplexity scores on PTB for different network sizes and models.
Inferred Topics
Law Company Parties Trading Carslaw spending democratic stock gm
lawyers sales republicans sp autojudge advertising gop price fordrights employees republican investor jaguar
attorney state senate standard carcourt taxes oakland chairman cars
general fiscal highway investors headquarterscommon appropriation democrats retirement british
mr budget bill holders executivesinsurance ad district merrill model
Table: Five Topics from the TopicRNN Model
Inferred Document Distributions
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.05
0.10
0.15
0.20
0.25Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
Figure: Inferred distributions using TopicGRU on threedifferent documents. The content of these documents is addedon the appendix. This shows that some of the topics are beingpicked up depending on the input document.
Unsupervised Feature Extraction
Clustered learned features from IMDB 100K movie reviews.
Promising Results on Sentiment Classification
Model Reported Classification Error rateBoW (bnc) (Maas et al., 2011) 12.20%BoW (b∆ tc) (Maas et al., 2011) 11.77%LDA (Maas et al., 2011) 32.58%Full + BoW (Maas et al., 2011) 11.67%Full + Unlabelled + BoW (Maas et al., 2011) 11.11%WRRBM (Dahl et al., 2012) 12.58%WRRBM + BoW (bnc) (Dahl et al., 2012) 10.77%MNB-uni (Wang & Manning, 2012) 16.45%MNB-bi (Wang & Manning, 2012) 13.41%SVM-uni (Wang & Manning, 2012) 13.05%SVM-bi (Wang & Manning, 2012) 10.84%NBSVM-uni (Wang & Manning, 2012) 11.71%seq2-bown-CNN (Johnson & Zhang, 2014) 14.70%NBSVM-bi (Wang & Manning, 2012) 8.78%Paragraph Vector (Le & Mikolov, 2014) 7.42%SA-LSTM with joint training (Dai & Le, 2015) 14.70%LSTM with tuning and dropout (Dai & Le, 2015) 13.50%LSTM initialized with word2vec embeddings (Dai & Le, 2015) 10.00%SA-LSTM with linear gain (Dai & Le, 2015) 9.17%LM-TM (Dai & Le, 2015) 7.64%SA-LSTM (Dai & Le, 2015) 7.24%TopicRNN 6.28%
Remaining challenges
1. Learning to encode and decode rare words well.
2. More on capturing long term dependencies.