generating sentences from a continuous spacemodeling of music infabius and van amersfoort (2014)....

Generating Sentences from a Continuous Space

Samuel R. Bowman∗

NLP Group and Dept. of LinguisticsStanford University

[email protected]

Luke Vilnis∗

CICSUniversity of Massachusetts Amherst

[email protected]

Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz & Samy BengioGoogle Brain

{vinyals, adai, rafalj, bengio}@google.com

Abstract

The standard recurrent neural networklanguage model (rnnlm) generates sen-tences one word at a time and does notwork from an explicit global sentence rep-resentation. In this work, we introduceand study an rnn-based variational au-toencoder generative model that incorpo-rates distributed latent representations ofentire sentences. This factorization al-lows it to explicitly model holistic prop-erties of sentences such as style, topic,and high-level syntactic features. Samplesfrom the prior over these sentence repre-sentations remarkably produce diverse andwell-formed sentences through simple de-terministic decoding. By examining pathsthrough this latent space, we are able togenerate coherent novel sentences that in-terpolate between known sentences. Wepresent techniques for solving the difficultlearning problem presented by this model,demonstrate its effectiveness in imputingmissing words, explore many interestingproperties of the model’s latent sentencespace, and present negative results on theuse of the model in language modeling.

1 Introduction

Recurrent neural network language models(rnnlms, Mikolov et al., 2011) represent the stateof the art in unsupervised generative modelingfor natural language sentences. In supervisedsettings, rnnlm decoders conditioned on task-specific features are the state of the art in taskslike machine translation (Sutskever et al., 2014;Bahdanau et al., 2015) and image captioning(Vinyals et al., 2015; Mao et al., 2015; Donahueet al., 2015). The rnnlm generates sentencesword-by-word based on an evolving distributedstate representation, which makes it a proba-bilistic model with no significant independence

∗First two authors contributed equally. Work wasdone when all authors were at Google, Inc.

i went to the store to buy some groceries .i store to buy some groceries .i were to buy any groceries .horses are to buy any groceries .horses are to buy any animal .horses the favorite any animal .horses the favorite favorite animal .horses are my favorite animal .

Table 1: Sentences produced by greedily decodingfrom points between two sentence encodings witha conventional autoencoder. The intermediate sen-tences are not plausible English.

assumptions, and makes it capable of modelingcomplex distributions over sequences, includingthose with long-term dependencies. However, bybreaking the model structure down into a series ofnext-step predictions, the rnnlm does not exposean interpretable representation of global featureslike topic or of high-level syntactic properties.

We propose an extension of the rnnlm that isdesigned to explicitly capture such global featuresin a continuous latent variable. Naively, maxi-mum likelihood learning in such a model presentsan intractable inference problem. Drawing inspi-ration from recent successes in modeling images(Gregor et al., 2015), handwriting, and naturalspeech (Chung et al., 2015), our model circum-vents these difficulties using the architecture of avariational autoencoder and takes advantage of re-cent advances in variational inference (Kingma andWelling, 2015; Rezende et al., 2014) that introducea practical training technique for powerful neuralnetwork generative models with latent variables.

Our contributions are as follows: We propose avariational autoencoder architecture for text anddiscuss some of the obstacles to training it as wellas our proposed solutions. We find that on a stan-dard language modeling evaluation where a globalvariable is not explicitly needed, this model yieldssimilar performance to existing rnnlms. We alsoevaluate our model using a larger corpus on thetask of imputing missing words. For this task,we introduce a novel evaluation strategy using an

arX

iv:1

511.

0634

9v4

[cs

.LG

] 1

2 M

ay 2

016

adversarial classifier, sidestepping the issue of in-tractable likelihood computations by drawing in-spiration from work on non-parametric two-sampletests and adversarial training. In this setting,our model’s global latent variable allows it to dowell where simpler models fail. We finally intro-duce several qualitative techniques for analyzingthe ability of our model to learn high level fea-tures of sentences. We find that they can producediverse, coherent sentences through purely deter-ministic decoding and that they can interpolatesmoothly between sentences.

2 Background

2.1 Unsupervised sentence encoding

A standard rnn language model predicts eachword of a sentence conditioned on the previousword and an evolving hidden state. While effec-tive, it does not learn a vector representation ofthe full sentence. In order to incorporate a contin-uous latent sentence representation, we first need amethod to map between sentences and distributedrepresentations that can be trained in an unsuper-vised setting. While no strong generative modelis available for this problem, three non-generativetechniques have shown promise: sequence autoen-coders, skip-thought, and paragraph vector.

Sequence autoencoders have seen some successin pre-training sequence models for superviseddownstream tasks (Dai and Le, 2015) and in gen-erating complete documents (Li et al., 2015a).An autoencoder consists of an encoder functionϕenc and a probabilistic decoder model p(x|~z =ϕenc(x)), and maximizes the likelihood of an ex-ample x conditioned on ~z, the learned code forx. In the case of a sequence autoencoder, bothencoder and decoder are rnns and examples aretoken sequences.

Standard autoencoders are not effective at ex-tracting for global semantic features. In Table 1,we present the results of computing a path or ho-motopy between the encodings for two sentencesand decoding each intermediate code. The in-termediate sentences are generally ungrammaticaland do not transition smoothly from one to theother. This suggests that these models do notgenerally learn a smooth, interpretable feature sys-tem for sentence encoding. In addition, since thesemodels do not incorporate a prior over ~z, they can-not be used to assign probabilities to sentences orto sample novel sentences.

Two other models have shown promise in learn-ing sentence encodings, but cannot be used ina generative setting: Skip-thought models (Kiroset al., 2015) are unsupervised learning models thattake the same model structure as a sequence au-toencoder, but generate text conditioned on aneighboring sentence from the target text, instead

of on the target sentence itself. Finally, para-graph vector models (Le and Mikolov, 2014) arenon-recurrent sentence representation models. In aparagraph vector model, the encoding of a sentenceis obtained by performing gradient-based inferenceon a prospective encoding vector with the goal ofusing it to predict the words in the sentence.

2.2 The variational autoencoder

The variational autoencoder (vae, Kingma andWelling, 2015; Rezende et al., 2014) is a genera-tive model that is based on a regularized versionof the standard autoencoder. This model imposesa prior distribution on the hidden codes ~z whichenforces a regular geometry over codes and makesit possible to draw proper samples from the modelusing ancestral sampling.

The vae modifies the autoencoder architectureby replacing the deterministic function ϕenc witha learned posterior recognition model, q(~z|x). Thismodel parametrizes an approximate posterior dis-tribution over ~z (usually a diagonal Gaussian) witha neural network conditioned on x. Intuitively, thevae learns codes not as single points, but as softellipsoidal regions in latent space, forcing the codesto fill the space rather than memorizing the train-ing data as isolated codes.

If the vae were trained with a standard autoen-coder’s reconstruction objective, it would learn toencode its inputs deterministically by making thevariances in q(~z|x) vanishingly small (Raiko et al.,2015). Instead, the vae uses an objective whichencourages the model to keep its posterior distri-butions close to a prior p(~z), generally a standardGaussian (µ = ~0, σ = ~1). Additionally, this objec-tive is a valid lower bound on the true log likelihoodof the data, making the vae a generative model.This objective takes the following form:

L(θ;x) = −kl(qθ(~z|x)||p(~z))+ Eqθ(~z|x)[log pθ(x|~z)]

≤ log p(x) .

(1)

This forces the model to be able to decode plausiblesentences from every point in the latent space thathas a reasonable probability under the prior.

In the experiments presented below using vaemodels, we use diagonal Gaussians for the priorand posterior distributions p(~z) and q(~z|x), usingthe Gaussian reparameterization trick of Kingmaand Welling (2015). We train our models withstochastic gradient descent, and at each gradientstep we estimate the reconstruction cost using asingle sample from q(~z|x), but compute the kl di-vergence term of the cost function in closed form,again following Kingma and Welling (2015).

linear

linearz

µ

σ

EncodingLSTMCell

EncodingLSTMCell

DecodingLSTMCell

DecodingLSTMCell

DecodingLSTMCell

Figure 1: The core structure of our variational au-toencoder language model. Words are representedusing a learned dictionary of embedding vectors.

3 A VAE for sentences

We adapt the variational autoencoder to textby using single-layer lstm rnns (Hochreiter andSchmidhuber, 1997) for both the encoder and thedecoder, essentially forming a sequence autoen-coder with the Gaussian prior acting as a regu-larizer on the hidden code. The decoder serves asa special rnn language model that is conditionedon this hidden code, and in the degenerate settingwhere the hidden code incorporates no useful in-formation, this model is effectively equivalent to anrnnlm. The model is depicted in Figure 1, and isused in all of the experiments discussed below.

We explored several variations on this architec-ture, including concatenating the sampled ~z to thedecoder input at every time step, using a soft-plus parametrization for the variance, and usingdeep feedforward networks between the encoderand latent variable and the decoder and latent vari-able. We noticed little difference in the model’sperformance when using any of these variations.However, when including feedforward networks be-tween the encoder and decoder we found that itis necessary to use highway network layers (Srivas-tava et al., 2015) for the model to learn. We discusshyperparameter tuning in the appendix.

We also experimented with more sophisticatedrecognition models q(~z|x), including a multistepsampling model styled after draw (Gregor et al.,2015), and a posterior approximation using nor-malizing flows (Rezende and Mohamed, 2015).However, we were unable to reap significant gainsover our plain vae.

While the strongest results with vaes to datehave been on continuous domains like images, therehas been some work on discrete sequences: a tech-nique for doing this using rnn encoders and de-coders, which shares the same high-level architec-ture as our model, was proposed under the nameVariational Recurrent Autoencoder (vrae) for themodeling of music in Fabius and van Amersfoort(2014). While there has been other work on includ-ing continuous latent variables in rnn-style mod-els for modeling speech, handwriting, and music(Bayer and Osendorfer, 2015; Chung et al., 2015),these models include separate latent variables pertimestep and are unsuitable for our goal of model-ing global features.

In a recent paper with goals similar to ours,Miao et al. (2015) introduce an effective VAE-based document-level language model that modelstexts as bags of words, rather than as sequences.They mention briefly that they have to train theencoder and decoder portions of the network in al-ternation rather than simultaneously, possibly as away of addressing some of the issues that we dis-cuss in Section 3.1.

3.1 Optimization challenges

Our model aims to learn global latent represen-tations of sentence content. We can quantify thedegree to which our model learns global featuresby looking at the variational lower bound objec-tive (1). The bound breaks into two terms: thedata likelihood under the posterior (expressed ascross entropy), and the kl divergence of the pos-terior from the prior. A model that encodes usefulinformation in the latent variable ~z will have a non-zero kl divergence term and a relatively small crossentropy term. Straightforward implementations ofour vae fail to learn this behavior: except in van-ishingly rare cases, most training runs with mosthyperparameters yield models that consistently setq(~z|x) equal to the prior p(~z), bringing the kl di-vergence term of the cost function to zero.

When the model does this, it is essentially be-having as an rnnlm. Because of this, it can ex-press arbitrary distributions over the output sen-tences (albeit with a potentially awkward left-to-right factorization) and can thereby achieve like-lihoods that are close to optimal. Previous workon vaes for image modeling (Kingma and Welling,2015) used a much weaker independent pixel de-coder model p(x|~z), forcing the model to use theglobal latent variable to achieve good likelihoods.In a related result, recent approaches to image gen-eration that use lstm decoders are able to do wellwithout vae-style global latent variables (Theisand Bethge, 2015).

This problematic tendency in learning is com-pounded by the lstm decoder’s sensitivity to sub-tle variation in the hidden states, such as that in-troduced by the posterior sampling process. Thiscauses the model to initially learn to ignore ~z andgo after low hanging fruit, explaining the data withthe more easily optimized decoder. Once this hashappened, the decoder ignores the encoder and lit-tle to no gradient signal passes between the two,yielding an undesirable stable equilibrium with thekl cost term at zero. We propose two techniquesto mitigate this issue.

KL cost annealing In this simple approach tothis problem, we add a variable weight to the klterm in the cost function at training time. At thestart of training, we set that weight to zero, sothat the model learns to encode as much informa-

Step KL term weightStep KL term value1 0.000435 1 0.015607

101 0.000555 751 6.977695201 0.000708 1501 7.502852301 0.000903 2251 2.323177401 0.001152 3001 1.373341501 0.001469 3751 0.886143601 0.001874 4501 0.850737701 0.002389 5251 0.891682801 0.003047 6001 0.820286901 0.003884 6751 0.8805471001 0.004951 7501 0.8874761101 0.006309 8251 0.9224851201 0.008036 9001 0.8745221301 0.010231 9751 0.9692361401 0.013018 10501 0.9864241501 0.016551 11251 0.9422971601 0.021022 12001 0.9894141701 0.026669 12751 0.9728591801 0.03378 13501 1.0245961901 0.042705 14251 1.045332001 0.053856 15001 1.0255732101 0.067712 15751 1.0724772201 0.084814 16501 1.0927222301 0.105745 17251 1.0416432401 0.131103 18001 1.128839

0.01.02.03.04.05.06.07.08.0

0%

20%

40%

60%

80%

100%

0 10000 20000 30000 40000 50000

KL

term

val

ue

KL

term

wei

ght

StepKL term weight KL term value

Figure 2: The weight of the kl divergence termof variational lower bound according to a typicalsigmoid annealing schedule plotted alongside the(unweighted) value of the kl divergence term forour vae on the Penn Treebank.

tion in ~z as it can. Then, as training progresses, wegradually increase this weight, forcing the model tosmooth out its encodings and pack them into theprior. We increase this weight until it reaches 1,at which point the weighted cost function is equiv-alent to the true variational lower bound. In thissetting, we do not optimize the proper lower boundon the training data likelihood during the earlystages of training, but we nonetheless see improve-ments on the value of that bound at convergence.This can be thought of as annealing from a vanillaautoencoder to a vae. The rate of this increase istuned as a hyperparameter.

Figure 2 shows the behavior of the kl cost termduring the first 50k steps of training on Penn Tree-bank (Marcus et al., 1993) language modeling withkl cost annealing in place. This example reflects apattern that we observed often: kl spikes early intraining while the model can encode information in~z cheaply, then drops substantially once it beginspaying the full kl divergence penalty, and finallyslowly rises again before converging as the modellearns to condense more information into ~z.

Word dropout and historyless decoding Inaddition to weakening the penalty term on the en-codings, we also experiment with weakening thedecoder. As in rnnlms and sequence autoen-coders, during learning our decoder predicts eachword conditioned on the ground-truth previousword. A natural way to weaken the decoder isto remove some or all of this conditioning infor-mation during learning. We do this by randomlyreplacing some fraction of the conditioned-on wordtokens with the generic unknown word token unk.This forces the model to rely on the latent variable~z to make good predictions. This technique is avariant of word dropout (Iyyer et al., 2015; Kumaret al., 2015), applied not to a feature extractor butto a decoder. We also experimented with standarddropout (Srivastava et al., 2014) applied to the in-put word embeddings in the decoder, but this didnot help the model learn to use the latent variable.

This technique is parameterized by a keep ratek ∈ [0, 1]. We tune this parameter both for ourvae and for our baseline rnnlm. Taken to theextreme of k = 0, the decoder sees no input, and isthus able to condition only on the number of wordsproduced so far, yielding a model that is extremelylimited in the kinds of distributions it can modelwithout using ~z.

4 Results: Language modeling

In this section, we report on language modelingexperiments on the Penn Treebank in an effort todiscover whether the inclusion of a global latentvariable is helpful for this standard task. For thisreason, we restrict our vae hyperparameter searchto those models which encode a non-trivial amountin the latent variable, as measured by the kl di-vergence term of the variational lower bound.

Results We used the standard train–test splitfor the corpus, and report test set results in Ta-ble 2. The results shown reflect the training andtest set performance of each model at the trainingstep at which the model performs best on the de-velopment set. Our reported figures for the vaereflect the variational lower bound on the test like-lihood, while for the rnnlms, which can be eval-uated exactly, we report the true test likelihood.This discrepancy puts the vae at a potential dis-advantage.

In the standard setting, the vae performsslightly worse than the rnnlm baseline, thoughit does succeed in using the latent space to a lim-ited extent: it has a reconstruction cost (99) betterthan that of the baseline rnnlm, but makes up forthis with a kl divergence cost of 2. Training a vaein the standard setting without both word dropoutand cost annealing reliably results in models withequivalent performance to the baseline rnnlm, andzero kl divergence.

To demonstrate the ability of the latent variableto encode the full content of sentences in additionto more abstract global features, we also providenumbers for an inputless decoder that does notcondition on previous tokens, corresponding to aword dropout keep rate of 0. In this regime wecan see that the variational lower bound containsa significantly larger kl term and shows a substan-tial improvement over the weakened rnnlm, whichis essentially limited to using unigram statisticsin this setting. While it is weaker than a stan-dard decoder, the inputless decoder has the inter-esting property that its sentence generating pro-cess is fully differentiable. Advances in generativemodels of this kind could be promising as a meansof generating text while using adversarial trainingmethods, which require differentiable generators.

Even with the techniques described in the pre-

Model Standard Inputless DecoderTrain nll Train ppl Test nll Test ppl Train nll Train ppl Test nll Test ppl

RNNLM 100 – 95 100 – 116 135 – 600 135 – > 600VAE 98 (2) 100 101 (2) 119 120 (15) 300 125 (15) 380

Table 2: Penn Treebank language modeling results, reported as negative log likelihoods and as perplexi-ties. Lower is better for both metrics. For the vae, the kl term of the likelihood is shown in parenthesesalongside the total likelihood.

vious section, including the inputless decoder, wewere unable to train models for which the kl diver-gence term of the cost function dominates the re-construction term. This suggests that it is still sub-stantially easier to learn to factor the data distribu-tion using simple local statistics, as in the rnnlm,such that an encoder will only learn to encode in-formation in ~z when that information cannot beeffectively described by these local statistics.

5 Results: Imputing missing words

We claim that the our vae’s global sentence fea-tures make it especially well suited to the task ofimputing missing words in otherwise known sen-tences. In this section, we present a techniquefor imputation and a novel evaluation strategy in-spired by adversarial training. Qualitatively, wefind that the vae yields more diverse and plausibleimputations for the same amount of computation(see the examples given in Table 3), but precisequantitative evaluation requires intractable likeli-hood computations. We sidestep this by introduc-ing a novel evaluation strategy.

While the standard rnnlm is a powerful genera-tive model, the sequential nature of likelihood com-putation and decoding makes it unsuitable for per-forming inference over unknown words given someknown words (the task of imputation). Except inthe special case where the unknown words all ap-pear at the end of the decoding sequence, samplingfrom the posterior over the missing variables is in-tractable for all but the smallest vocabularies. Fora vocabulary of size V , it requires O(V ) runs of fullrnn inference per step of Gibbs sampling or iter-ated conditional modes. Worse, because of the di-rectional nature of the graphical model given by anrnnlm, many steps of sampling could be requiredto propagate information between unknown vari-ables and the known downstream variables. Thevae, while it suffers from the same intractabilityproblems when sampling or computing map im-putations, can more easily propagate informationbetween all variables, by virtue of having a globallatent variable and a tractable recognition model.

For this experiment and subsequent analysis, wetrain our models on the Books Corpus introducedin Kiros et al. (2015). This is a collection of textfrom 12k e-books, mostly fiction. The dataset,

after pruning, contains approximately 80m sen-tences. We find that this much larger amount ofdata produces more subjectively interesting gener-ative models than smaller standard language mod-eling datasets. We use a fixed word dropout rate of75% when training this model and all subsequentmodels unless otherwise specified. Our models (thevae and rnnlm) are trained as language models,decoding right-to-left to shorten the dependenciesduring learning for the vae. We use 512 hiddenunits.

Inference method To generate imputationsfrom the two models, we use beam search withbeam size 15 for the rnnlm and approximate iter-ated conditional modes (Besag, 1986) with 3 stepsof a beam size 5 search for the vae. This allowsus to compare the same amount of computationfor both models. We find that breaking decod-ing for the vae into several sequential steps is nec-essary to propagate information among the vari-ables. Iterated conditional modes is a techniquefor finding the maximum joint assignment of a setof variables by alternately maximizing conditionaldistributions, and is a generalization of “hard-em”algorithms like k-means (Kearns et al., 1998). Forapproximate iterated conditional modes, we firstinitialize the unknown words to the unk token. Wethen alternate assigning the latent variable to itsmode from the recognition model, and performingconstrained beam search to assign the unknownwords. Both of our generative models are trainedto decode sentences from right-to-left, which short-ens the dependencies involved in learning for thevae, and we impute the final 20% of each sen-tence. This lets us demonstrate the advantages ofthe global latent variable in the regime where thernnlm suffers the most from its inductive bias.

Adversarial evaluation Drawing inspirationfrom adversarial training methods for generativemodels as well as non-parametric two-sample tests(Goodfellow et al., 2014; Li et al., 2015b; Dentonet al., 2015; Gretton et al., 2012), we evaluate theimputed sentence completions by examining theirdistinguishability from the true sentence endings.While the non-differentiability of the discrete rnndecoder prevents us from easily applying the ad-versarial criterion at train time, we can define a

but now , as they parked out front and owen stepped out of the car , he could seeTrue: that the transition was complete . RNNLM: it , ” i said . VAE: through the driver ’s door .

you kill him and hisTrue: men . RNNLM: . ” VAE: brother .

not surprising , the mothers dont exactly see eye to eye with meTrue: on this matter . RNNLM: , i said . VAE: , right now .

Table 3: Examples of using beam search to impute missing words within sentences. Since we decode fromright to left, note the stereotypical completions given by the rnnlm, compared to the vae completionsthat often use topic data and more varied vocabulary.

Model Adv. Err. (%) NLLUnigram lstm rnnlm

RNNLM (15 bm.) 28.32 38.92 46.01VAE (3x5 bm.) 22.39 35.59 46.14

Table 4: Results for adversarial evaluation of im-putations. Unigram and lstm numbers are theadversarial error (see text) and rnnlm numbersare the negative log-likelihood given to entire gen-erated sentence by the rnnlm, a measure of sen-tence typicality. Lower is better on both metrics.The vae is able to generate imputations that aresignificantly more difficult to distinguish from thetrue sentences.

very flexible test time evaluation by training a dis-criminant function to separate the generated andtrue sentences, which defines an adversarial error.

We train two classifiers: a bag-of-unigrams lo-gistic regression classifier and an lstm logistic re-gression classifier that reads the input sentence andproduces a binary prediction after seeing the finaleos token. We train these classifiers using earlystopping on a 80/10/10 train/dev/test split of 320ksentences, constructing a dataset of 50% completesentences from the corpus (positive examples) and50% sentences with imputed completions (negativeexamples). We define the adversarial error as thegap between the ideal accuracy of the discrimina-tor (50%, i.e. indistinguishable samples), and theactual accuracy attained.

Results As a consequence of this experimentalsetup, the rnnlm cannot choose anything outsideof the top 15 tokens given by the rnn’s initial un-conditional distribution P (x1|Null) when produc-ing the final token of the sentence, since it has notyet generated anything to condition on, and has abeam size of 15. Table 4 shows that this weaknessmakes the rnnlm produce far less diverse samplesthan the vae and suffer accordingly versus the ad-versarial classifier. Additionally, we include thescore given to the entire sentence with the imputedcompletion given a separate independently trainedlanguage model. The likelihood results are com-

parable, though the rnnlms favoring of generichigh-probability endings such as “he said,” givesit a slightly lower negative log-likelihood. Mea-suring the rnnlm likelihood of sentences them-selves produced by an rnnlm is not a good mea-sure of the power of the model, but demonstratesthat the rnnlm can produce what it sees as high-quality imputations by favoring typical local statis-tics, even though their repetitive nature produceseasy failure modes for the adversarial classifier.Accordingly, under the adversarial evaluation ourmodel substantially outperforms the baseline sinceit is able to efficiently propagate information bidi-rectionally through the latent variable.

6 Analyzing variational models

We now turn to more qualitative analysis of themodel. Since our decoder model p(x|~z) is a sophis-ticated rnnlm, simply sampling from the directedgraphical model (first p(~z) then p(x|~z)) would nottell us much about how much of the data is beingexplained by each of the latent space and the de-coder. Instead, for this part of the evaluation, wesample from the Gaussian prior, but use a greedydeterministic decoder for p(x|~z), the rnnlm con-ditioned on ~z. This allows us to get a sense of howmuch of the variance in the data distribution is be-ing captured by the distributed vector ~z as opposedto the decoder. Interestingly, these results qualita-tively demonstrate that large amounts of variationin generated language can be achieved by followingthis procedure. In the appendix, we provide someresults on small text classification tasks.

6.1 Analyzing the impact of word dropout

For this experiment, we train on the Books Cor-pus and test on a held out 10k sentence test setfrom that corpus. We find that train and test setperformance are very similar. In Figure 3, we ex-amine the impact of word dropout on the varia-tional lower bound, broken down into kl diver-gence and cross entropy components. We drop outwords with the specified keep rate at training time,but supply all words as inputs at test time exceptin the 0% setting.

We do not re-tune the hyperparameters for each

100% word keep 75% word keep

“ no , ” he said . “ love you , too . ”“ thank you , ” he said . she put her hand on his shoulder and followed him

to the door .

50% word keep 0% word keep

“ maybe two or two . ” i i hear some of of ofshe laughed again , once again , once again , andthought about it for a moment in long silence .

i was noticed that she was holding the in in of thethe in

Table 5: Samples from a model trained with varying amounts of word dropout. We sample a vector fromthe Gaussian prior and apply greedy decoding to the result. Note that diverse samples can be achievedusing a purely deterministic decoding procedure. Once we use reach a purely inputless decoder in the0% setting, however, the samples cease to be plausible English sentences.

he had been unable to conceal the fact that there was a logical explanation for his inability toalter the fact that they were supposed to be on the other side of the house .

with a variety of pots strewn scattered across the vast expanse of the high ceiling , a vase ofcolorful flowers adorned the tops of the rose petals littered the floor and littered the floor .

atop the circular dais perched atop the gleaming marble columns began to emerge from atop thestone dais, perched atop the dais .

Table 6: Greedily decoded sentences from a model with 75% word keep probability, sampling fromlower-likelihood areas of the latent space. Note the consistent topics and vocabulary usage.

keep rate Cross entropy KL divergence100% 45.01017 0.010358

90% 40.897953 4.665799

75% 37.710022 8.751512

50% 33.433636 15.13052

0% 34.825763 20.906685

keep prob xent kl100% 3.059872 0.000953 3.06082590% 2.706509 0.388772 3.09528175% 2.462569 0.695894 3.15846350% 2.174506 1.109832 3.2843380% 2.235086 1.478137 3.713223

0102030405060

100% 90% 75% 50% 0%Keep rate

KL divergence Cross entropy

Figure 3: The values of the two terms of the costfunction as word dropout increases.

run, which results in the model with no dropoutencoding very little information in ~z (i.e., the klcomponent is small). We can see that as we lowerthe keep rate for word dropout, the amount of in-formation stored in the latent variable increases,and the overall likelihood of the model degradessomewhat. Results from the Section 4 indicatethat a model with no latent variable would degradein performance significantly more in the presenceof heavy word dropout.

We also qualitatively evaluate samples, todemonstrate that the increased kl allows meaning-ful sentences to be generated purely from contin-uous sampling. Since our decoder model p(x|~z) isa sophisticated rnnlm, simply sampling from thedirected graphical model (first p(~z) then p(x|~z))would not tell us about how much of the data isbeing explained by the learned vector vs. the lan-guage model. Instead, for this part of the qual-itative evaluation, we sample from the Gaussianprior, but use a greedy deterministic decoder for x,

taking each token xt = argmaxxtp(xt|x0,...,t−1, ~z).This allows us to get a sense of how much of thevariance in the data distribution is being capturedby the distributed vector ~z as opposed to by locallanguage model dependencies.

These results, shown in Table 5, qualitativelydemonstrate that large amounts of variation ingenerated language can be achieved by followingthis procedure. At the low end, where very lit-tle of the variance is explained by ~z, we see thatgreedy decoding applied to a Gaussian sample doesnot produce diverse sentences. As we increase theamount of word dropout and force ~z to encodemore information, we see the sentences becomemore varied, but past a certain point they beginto repeat words or show other signs of ungram-maticality. Even in the case of a fully dropped-outdecoder, the model is able to capture higher-orderstatistics not present in the unigram distribution.

Additionally, in Table 6 we examine the effectof using lower-probability samples from the latentGaussian space for a model with a 75% word keeprate. We find lower-probability samples by ap-plying an approximately volume-preserving trans-formation to the Gaussian samples that stretchessome eigenspaces by up to a factor of 4. This hasthe effect of creating samples that are not too im-probable under the prior, but still reach into thetails of the distribution. We use a random lineartransformation, with matrix elements drawn froma uniform distribution from [−c, c], with c chosento give the desired properties (0.1 in our experi-

input we looked out at the setting sun . i went to the kitchen . how are you doing ?mean they were laughing at the same time . i went to the kitchen . what are you doing ?samp. 1 ill see you in the early morning . i went to my apartment . “ are you sure ?samp. 2 i looked up at the blue sky . i looked around the room . what are you doing ?samp. 3 it was down on the dance floor . i turned back to the table . what are you doing ?

Table 7: Three sentences which were used as inputs to the vae, presented with greedy decodes from themean of the posterior distribution, and from three samples from that distribution.

“ i want to talk to you . ”“i want to be with you . ”“i do n’t want to be with you . ”i do n’t want to be with you .she did n’t want to be with him .

he was silent for a long moment .he was silent for a moment .it was quiet for a moment .it was dark and cold .there was a pause .it was my turn .

Table 8: Paths between pairs of random points invae space: Note that intermediate sentences aregrammatical, and that topic and syntactic struc-ture are usually locally consistent.

ments). Here we see that the sentences are far lesstypical, but for the most part are grammatical andmaintain a clear topic, indicating that the latentvariable is capturing a rich variety of global fea-tures even for rare sentences.

6.2 Sampling from the posterior

In addition to generating unconditional samples,we can also examine the sentences decoded fromthe posterior vectors p(z|x) for various sentencesx. Because the model is regularized to produce dis-tributions rather than deterministic codes, it doesnot exactly memorize and round-trip the input. In-stead, we can see what the model considers to besimilar sentences by examining the posterior sam-ples in Table 7. The codes appear to capture in-formation about the number of tokens and partsof speech for each token, as well as topic informa-tion. As the sentences get longer, the fidelity ofthe round-tripped sentences decreases.

6.3 Homotopies

The use of a variational autoencoder allows us togenerate sentences using greedy decoding on con-tinuous samples from the space of codes. Addi-tionally, the volume-filling and smooth nature ofthe code space allows us to examine for the firsttime a concept of homotopy (linear interpolation)between sentences. In this context, a homotopy be-tween two codes ~z1 and ~z2 is the set of points on theline between them, inclusive, ~z(t) = ~z1∗(1−t)+~z2∗tfor t ∈ [0, 1]. Similarly, the homotopy between two

sentences decoded (greedily) from codes ~z1 and ~z2is the set of sentences decoded from the codes onthe line. Examining these homotopies allows us toget a sense of what neighborhoods in code spacelook like – how the autoencoder organizes infor-mation and what it regards as a continuous defor-mation between two sentences.

While a standard non-variational rnnlm doesnot have a way to perform these homotopies, avanilla sequence autoencoder can do so. As men-tioned earlier in the paper, if we examine the ho-motopies created by the sequence autoencoder inTable 1, though, we can see that the transition be-tween sentences is sharp, and results in ungram-matical intermediate sentences. This gives evi-dence for our intuition that the vae learns repre-sentations that are smooth and “fill up” the space.

In Table 8 (and in additional tables in the ap-pendix) we can see that the codes mostly containsyntactic information, such as the number of wordsand the parts of speech of tokens, and that all in-termediate sentences are grammatical. Some topicinformation also remains consistent in neighbor-hoods along the path. Additionally, sentences withsimilar syntax and topic but flipped sentiment va-lence, e.g. “the pain was unbearable” vs. “thethought made me smile”, can have similar embed-dings, a phenomenon which has been observed withsingle-word embeddings (for example the vectorsfor “bad” and “good” are often very similar due totheir similar distributional characteristics).

7 Conclusion

This paper introduces the use of a variationalautoencoder for natural language sentences. Wepresent novel techniques that allow us to trainour model successfully, and find that it can effec-tively impute missing words. We analyze the la-tent space learned by our model, and find that itis able to generate coherent and diverse sentencesthrough purely continuous sampling and providesinterpretable homotopies that smoothly interpo-late between sentences.

We hope in future work to investigate factoriza-tion of the latent variable into separate style andcontent components, to generate sentences condi-tioned on extrinsic features, to learn sentence em-beddings in a semi-supervised fashion for language

understanding tasks like textual entailment, and togo beyond adversarial evaluation to a fully adver-sarial training objective.

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In Proc.ICLR.

Justin Bayer and Christian Osendorfer. 2015.Learning stochastic recurrent networks. arXivpreprint arXiv:1411.7610 .

Julian Besag. 1986. On the statistical analysis ofdirty pictures. Journal of the Royal StatisticalSociety Series B (Methodological) pages 48–259.

Junyoung Chung, Kyle Kastner, Laurent Dinh,Kratarth Goel, Aaron Courville, and YoshuaBengio. 2015. A recurrent latent variable modelfor sequential data. In Proc. NIPS .

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proc. NIPS .

Emily Denton, Soumith Chintala, Arthur Szlam,and Rob Fergus. 2015. Deep generative imagemodels using a laplacian pyramid of adversarialnetworks. In Proc. NIPS .

Bill Dolan, Chris Quirk, and Chris Brockett.2004. Unsupervised construction of large para-phrase corpora: Exploiting massively parallelnews sources. In Proceedings of the 20th interna-tional conference on Computational Linguistics.Association for Computational Linguistics, page350.

Jeff Donahue, Lisa Anne Hendricks, SergioGuadarrama, Marcus Rohrbach, SubhashiniVenugopalan, Kate Saenko, and Trevor Darrell.2015. Long-term recurrent convolutional net-works for visual recognition and description. InProc. CVPR.

Otto Fabius and Joost R. van Amersfoort. 2014.Variational recurrent auto-encoders. arXivpreprint arXiv:1412.6581.

Ian Goodfellow, Jean Pouget-Abadie, MehdiMirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio.2014. Generative adversarial nets. In Proc.NIPS .

Karol Gregor, Ivo Danihelka, Alex Graves, andDaan Wierstra. 2015. DRAW: A recurrent neu-ral network for image generation. In Proc.ICML.

Arthur Gretton, Karsten M Borgwardt, Malte JRasch, Bernhard Scholkopf, and AlexanderSmola. 2012. A kernel two-sample test. JMLR13(1):723–773.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation9(8).

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume III. 2015. Deep un-ordered composition rivals syntactic methods fortext classification. In Proc. ACL.

Michael Kearns, Yishay Mansour, and Andrew YNg. 1998. An information-theoretic analysis ofhard and soft assignment methods for clustering.In Learning in graphical models, Springer, pages495–520.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. EMNLP .

Diederik P. Kingma and Max Welling. 2015. Auto-encoding variational bayes. In Proc. ICLR.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S Zemel, Antonio Torralba, Raquel Ur-tasun, and Sanja Fidler. 2015. Skip-thought vec-tors. arXiv preprint arXiv:1506.06726.

Ankit Kumar, Ozan Irsoy, Jonathan Su, JamesBradbury, Robert English, Brian Pierce, Pe-ter Ondruska, Ishaan Gulrajani, and RichardSocher. 2015. Ask me anything: Dynamic mem-ory networks for natural language processing.arXiv preprint arXiv:1506.07285.

Quoc V. Le and Tomas Mikolov. 2014. Distributedrepresentations of sentences and documents. InProc. ICML.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky.2015a. A hierarchical neural autoencoder forparagraphs and documents. In Proc. ACL.

Xin Li and Dan Roth. 2002. Learning questionclassifiers. In Proceedings of the 19th interna-tional conference on Computational linguistics-Volume 1 . Association for Computational Lin-guistics, pages 1–7.

Yujia Li, Kevin Swersky, and Richard Zemel.2015b. Generative moment matching networks.In Proc. ICML.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhi-heng Huang, and Alan Yuille. 2015. Deep cap-tioning with multimodal recurrent neural net-works (m-RNN). In Proc. ICLR.

Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large an-notated corpus of English: The Penn Treebank.Computational linguistics 19(2):313–330.

Yishu Miao, Lei Yu, and Phil Blunsom. 2015.Neural variational inference for text processing.arXiv preprint arXiv:1511.06038 .

Tomas Mikolov, Stefan Kombrink, Lukas Burget,Jan Honza Cernocky, and Sanjeev Khudanpur.2011. Extensions of recurrent neural networklanguage model. In Proc. ICASSP .

Tapani Raiko, Mathias Berglund, GuillaumeAlain, and Laurent Dinh. 2015. Techniquesfor learning binary stochastic feedforward neu-ral networks. In Proc. ICLR.

Danilo J. Rezende and Shakir Mohamed. 2015.Variational inference with normalizing flows. InProc. ICML.

Danilo J. Rezende, Shakir Mohamed, and DaanWierstra. 2014. Stochastic backpropagation andapproximate inference in deep generative mod-els. In Proc. ICML.

Jasper Snoek, Hugo Larochelle, and Ryan P.Adams. 2012. Practical Bayesian optimizationof machine learning algorithms. In Proc. NIPS .

Richard Socher, Eric H Huang, Jeffrey Pennin,Christopher D Manning, and Andrew Y Ng.2011. Dynamic pooling and unfolding recur-sive autoencoders for paraphrase detection. InAdvances in Neural Information Processing Sys-tems. pages 801–809.

Nitish Srivastava, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simpleway to prevent neural networks from overfitting.JMLR 15(1):1929–1958.

Rupesh Kumar Srivastava, Klaus Greff, andJurgen Schmidhuber. 2015. Training very deepnetworks. In Proc. NIPS .

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neuralnetworks. In Proc. NIPS .

Lucas Theis and Matthias Bethge. 2015. Gener-ative image modeling using spatial LSTMs. InProc. NIPS .

Oriol Vinyals, Alexander Toshev, Samy Bengio,and Dumitru Erhan. 2015. Show and tell: Aneural image caption generator. In Proc. CVPR.

Han Zhao, Zhengdong Lu, and Pascal Poupart.2015. Self-adaptive hierarchical sentence model.IJCAI .

Text classification

In order to further examine the the structure of therepresentations discovered by the vae, we conductclassification experiments on paraphrase detectionand question type classification. We train a vaewith a hidden state size of 1200 hidden units onthe Books Corpus, and use the posterior mean ofthe model as the extracted sentence vector. Wetrain classifiers on these means using the same ex-perimental protocol as Kiros et al. (2015).

Method Accuracy F1

Feats 73.2 –rae+dp 72.6 –rae+feats 74.2 –rae+dp+feats 76.8 83.6

st 73.0 81.9Bi-st 71.2 81.2Combine-st 73.0 82.0

vae 72.9 81.4vae+feats 75.0 82.4vae+combine-st 74.8 82.3Feats+combine-st 75.8 83.0vae+combine-st+feats 76.9 83.8

Table 9: Results for the msr Paraphrase Corpus.

Paraphrase detection For the task of para-phrase detection, we use the Microsoft ResearchParaphrase Corpus (Dolan et al., 2004). We com-pute features from the sentence vectors of sentencepairs in the same way as Kiros et al. (2015), con-catenating the elementwise products and the abso-lute value of the elementwise differences of the twovectors. We train an `2-regularized logistic regres-sion classifier and tune the regularization strengthusing cross-validation.

We present results in Table 9 and compare toseveral previous models for this task. Feats is thelexicalized baseline from Socher et al. (2011). raeuses the recursive autoencoder from that work, anddp adds their dynamic pooling step to calculatepairwise features. st uses features from the uni-directional skip-thought model, bi-st uses bidirec-tional skip-thought, and combine-st uses the con-catenation of those features. We also experimentedwith concatenating lexical features and the twotypes of distributed features.

We found that our features performed slightlyworse than skip-thought features by themselvesand slightly better than recursive autoencoder fea-tures, and were complementary and yielded strongperformance when simply concatenated with theskip-thought features.

Question classification We also conduct ex-periments on the TREC Question Classificationdataset of Li and Roth (2002). Following Kiroset al. (2015), we train an `2-regularized softmaxclassifier with 10-fold cross-validation to set theregularization. Note that using a linear classifierlike this one may disadvantage our representationshere, since the Gaussian distribution over hiddencodes in a vae is likely to discourage linear sepa-rability.

We present results in Table 10. Here, ae isa plain sequence autoencoder. We compare withresults from a bag of word vectors (cbow, Zhaoet al., 2015) and skip-thought (st). We also com-

Method Accuracy

st 91.4Bi-st 89.4Combine-st 92.2AE 84.2vae 87.0cbow 87.3vae, combine-st 92.0

rnn 90.2cnn 93.6

Table 10: Results for TREC Question Classifica-tion.

pare with an rnn classifier (Zhao et al., 2015) anda cnn classifier (Kim, 2014) both of which, un-like our model, are optimized end-to-end. We werenot able to make the vae codes perform betterthan cbow in this case, but they did outperformfeatures from the sequence autoencoder. Skip-thought performed quite well, possibly because theskip-thought training objective of next sentenceprediction is well aligned to this task: it essen-tially trains the model to generate sentences thataddress implicit open questions from the narrativeof the book. Combining the two representationsdid not give any additional performance gain overthe base skip-thought model.

Hyperparameter tuning

We extensively tune the hyperparameters of eachmodel using an automatic Bayesian hyperparame-ter tuning algorithm (based on Snoek et al., 2012)over development set data. We run the model witheach set of hyperpameters for 10 hours, operating12 experiments in parallel, and choose the best setof hyperparameters after 200 runs. Results for ourlanguage modeling experiments are reported in Ta-ble 11 on the next page.

Additional homotopies

Table 12, on the next page, shows additional homo-topies from our model. We observe that intermedi-ate sentences are almost always grammatical, andoften contain consistent topic, vocabulary and syn-tactic information in local neighborhoods as theyinterpolate between the endpoint sentences. Be-cause the model is trained on fiction, including ro-mance novels, the topics are often rather dramatic.

Standard Inputless Decoderrnnlm vae rnnlm vae

Embedding dim. 464 353 305 499lstm state dim. 337 191 68 350z dim. – 13 – 111Word dropout keep rate 0.66 0.62 – –

Table 11: Automatically selected hyperparameter values used for the models used in the Penn Treebanklanguage modeling experiments.

amazing , is n’t it ?so , what is it ?it hurts , isnt it ?why would you do that ?“ you can do it .“ i can do it .i ca n’t do it .“ i can do it .“ do n’t do it .“ i can do it .i could n’t do it .

no .he said .“ no , ” he said .“ no , ” i said .“ i know , ” she said .“ thank you , ” she said .“ come with me , ” she said .“ talk to me , ” she said .“ do n’t worry about it , ” she said .

i dont like it , he said .i waited for what had happened .it was almost thirty years ago .it was over thirty years ago .that was six years ago .he had died two years ago .ten , thirty years ago .“ it ’s all right here .“ everything is all right here .“ it ’s all right here .it ’s all right here .we are all right here .come here in five minutes .

this was the only way .it was the only way .it was her turn to blink .it was hard to tell .it was time to move on .he had to do it again .they all looked at each other .they all turned to look back .they both turned to face him .they both turned and walked away .

there is no one else in the world .there is no one else in sight .they were the only ones who mattered .they were the only ones left .he had to be with me .she had to be with him .i had to do this .i wanted to kill him .i started to cry .i turned to him .

im fine .youre right .“ all right .you ’re right .okay , fine .“ okay , fine .yes , right here .no , not right now .“ no , not right now .“ talk to me right now .please talk to me right now .i ’ll talk to you right now .“ i ’ll talk to you right now .“ you need to talk to me now .“ but you need to talk to me now .

Table 12: Selected homotopies between pairs of random points in the latent vae space.

generating sentences from a continuous spacemodeling of music infabius and van amersfoort (2014)....

Documents