joint training of variational auto-encoder and latent energy...

10
Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian Han 1 , Erik Nijkamp 2 , Linqi Zhou 2 , Bo Pang 2 , Song-Chun Zhu 2 , Ying Nian Wu 2 1 Department of Computer Science, Stevens Institute of Technology 2 Department of Statistics, University of California, Los Angeles [email protected], {enijkamp,bopang}@ucla.edu, [email protected] {sczhu,ywu}@stat.ucla.edu Abstract This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy- based model (EBM). The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint dis- tributions on the latent vector and the image, and the objec- tive function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates vari- ational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serve as the approximate synthesis sampler and in- ference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis qual- ity of the VAE. It also enables learning of an energy func- tion that is capable of detecting out of sample examples for anomaly detection. 1. Introduction The variational auto-encoder (VAE) [23, 35] is a power- ful method for generative modeling and unsupervised learn- ing. It consists of a generator model that transforms a noise vector to a signal such as image via a top-down convolu- tional network (also called deconvolutional network due to its top-down nature). It also consists of an inference model that infers the latent vector from the image via a bottom- up network. The VAE has seen many applications in im- age and video synthesis [12, 3] and unsupervised and semi- supervised learning [37, 22]. Despite its success, the VAE suffers from relatively weak synthesis quality compared to methods such as generative adversarial net (GANs) [11, 34] that are based on adversar- ial learning. While combing VAE objective function with the GAN objective function can improve the synthesis qual- ity, such a combination is rather ad hoc. In this paper, we shall pursue a more systematic integration of variational learning and adversarial learning. Specifically, instead of employing a discriminator as in GANs, we recruit a latent energy-based model (EBM) to mesh with VAE seamlessly in a joint training scheme. The generator model in VAE is a directed model, with a known prior distribution on the latent vector, such as Gaus- sian white noise distribution, and a conditional distribution of the image given the latent vector. The advantage of such a model is that it can generate synthesized examples by di- rect ancestral sampling. The generator model defines a joint probability density of the latent vector and the image in a top-down scheme. We may call this joint density the gener- ator density. VAE also has an inference model, which defines the con- ditional distribution of the latent vector given the image. To- gether with the data distribution that generates the observed images, they define a joint probability density of the latent vector and the image in a bottom-up scheme. We may call this joint density the joint data density, where the latent vec- tor may be considered as the missing data in the language of the EM algorithm [6]. As we shall explain later, the VAE amounts to joint mini- mization of the Kullback-Leibler divergence from data den- sity to the generator density, where the joint minimization is over the parameters of both the generator model and the inference model. In this minimization, the generator den- sity seeks to cover the modes of the data density, and as a result, the generator density can be overly dispersed. This may partially explain VAE’s lack of synthesis quality. Unlike the generator network, the latent EBM is an undi- rected model. It defines an unnormalized joint density on the latent vector and the image via a joint energy func- tion. Such undirected form enables the latent EBM to better approximate the data density than the generator network. However, the maximum likelihood learning of latent EBM requires (1) inference sampling: sampling from the condi- tional density of the latent vector given the observed ex- ample and (2) synthesis sampling: sampling from the joint density of the latent vector and the image. Both inference 1

Upload: others

Post on 06-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

Joint Training of Variational Auto-Encoder and Latent Energy-Based Model

Tian Han1, Erik Nijkamp2, Linqi Zhou2, Bo Pang2, Song-Chun Zhu2, Ying Nian Wu2

1Department of Computer Science, Stevens Institute of Technology2Department of Statistics, University of California, Los Angeles

[email protected], {enijkamp,bopang}@ucla.edu, [email protected]

{sczhu,ywu}@stat.ucla.edu

Abstract

This paper proposes a joint training method to learn boththe variational auto-encoder (VAE) and the latent energy-based model (EBM). The joint training of VAE and latentEBM are based on an objective function that consists ofthree Kullback-Leibler divergences between three joint dis-tributions on the latent vector and the image, and the objec-tive function is of an elegant symmetric and anti-symmetricform of divergence triangle that seamlessly integrates vari-ational and adversarial learning. In this joint trainingscheme, the latent EBM serves as a critic of the generatormodel, while the generator model and the inference modelin VAE serve as the approximate synthesis sampler and in-ference sampler of the latent EBM. Our experiments showthat the joint training greatly improves the synthesis qual-ity of the VAE. It also enables learning of an energy func-tion that is capable of detecting out of sample examples foranomaly detection.

1. IntroductionThe variational auto-encoder (VAE) [23, 35] is a power-

ful method for generative modeling and unsupervised learn-ing. It consists of a generator model that transforms a noisevector to a signal such as image via a top-down convolu-tional network (also called deconvolutional network due toits top-down nature). It also consists of an inference modelthat infers the latent vector from the image via a bottom-up network. The VAE has seen many applications in im-age and video synthesis [12, 3] and unsupervised and semi-supervised learning [37, 22].

Despite its success, the VAE suffers from relatively weaksynthesis quality compared to methods such as generativeadversarial net (GANs) [11, 34] that are based on adversar-ial learning. While combing VAE objective function withthe GAN objective function can improve the synthesis qual-ity, such a combination is rather ad hoc. In this paper, weshall pursue a more systematic integration of variational

learning and adversarial learning. Specifically, instead ofemploying a discriminator as in GANs, we recruit a latentenergy-based model (EBM) to mesh with VAE seamlesslyin a joint training scheme.

The generator model in VAE is a directed model, with aknown prior distribution on the latent vector, such as Gaus-sian white noise distribution, and a conditional distributionof the image given the latent vector. The advantage of sucha model is that it can generate synthesized examples by di-rect ancestral sampling. The generator model defines a jointprobability density of the latent vector and the image in atop-down scheme. We may call this joint density the gener-ator density.

VAE also has an inference model, which defines the con-ditional distribution of the latent vector given the image. To-gether with the data distribution that generates the observedimages, they define a joint probability density of the latentvector and the image in a bottom-up scheme. We may callthis joint density the joint data density, where the latent vec-tor may be considered as the missing data in the languageof the EM algorithm [6].

As we shall explain later, the VAE amounts to joint mini-mization of the Kullback-Leibler divergence from data den-sity to the generator density, where the joint minimizationis over the parameters of both the generator model and theinference model. In this minimization, the generator den-sity seeks to cover the modes of the data density, and as aresult, the generator density can be overly dispersed. Thismay partially explain VAE’s lack of synthesis quality.

Unlike the generator network, the latent EBM is an undi-rected model. It defines an unnormalized joint density onthe latent vector and the image via a joint energy func-tion. Such undirected form enables the latent EBM to betterapproximate the data density than the generator network.However, the maximum likelihood learning of latent EBMrequires (1) inference sampling: sampling from the condi-tional density of the latent vector given the observed ex-ample and (2) synthesis sampling: sampling from the jointdensity of the latent vector and the image. Both inference

1

Page 2: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

sampling and synthesis sampling require time-consumingMarkov chain Monte Carlo (MCMC).

In this paper, we propose to jointly train the VAE and thelatent EBM so that these two models can borrow strengthfrom each other. The objective function of the joint train-ing method consists of the Kullback-Leibler divergencesbetween three joint densities of the latent vector and theimage, namely the data density, the generator density, andthe latent EBM density. The three Kullback-Leilber diver-gences form an elegant symmetric and anti-symmetric formof divergence triangle that integrates the variational learningand the adversarial learning seamlessly.

The joint training is beneficial to both the VAE and thelatent EBM. The latent EBM has a more flexible form andcan approximate the data density better than the generatormodel. It serves as a critic of the generator model by judg-ing it against the data density. To the generator model, thelatent EBM serves as a surrogate of the data density and atarget density for the generator model to approximate. Thegenerator model and the associated inference model, in re-turn, serve as approximate synthesis sampler and inferencesampler of the latent EBM, thus relieving the latent EBM ofthe burden of MCMC sampling.

Our experiments show that the joint training method canlearn the generator model with strong synthesis ability. Itcan also learn an energy function that is capable of anomalydetection.

2. Contributions and related workThe following are contributions of our work. (1) We pro-

pose a joint training method to learn both VAE and latentEBM. The objective function is of a symmetric and anti-symmetric form of divergence triangle. (2) The proposedmethod integrates variational and adversarial learning. (3)The proposed method integrates the research themes initi-ated by the Boltzmann machine and Helmholtz machine.

The following are the themes that are related to our work.(1) Variational and adversarial learning. Over the past

few years, there has been an explosion in research onvariational learning and adversarial learning, inspired byVAE [23, 35, 37, 12] and GAN [11, 34, 2, 41] respectively.One aim of our work is to find a natural integration of vari-ational and adversarial learning. We also compare withsome prominent methods along this theme in our experi-ments. Notably, adversarially learned inference (ALI) [9, 7]combines the learning of the generator model and inferencemodel in an adversarial framework. It can be improved byadding conditional entropy regularization as in more recentmethods ALICE [26] and SVAE [4]. Though these meth-ods are trained using joint discriminator on image and latentvector, such a discriminator is not a probability density, thusit is not a latent EBM.

(2) Helmholtz machine and Boltzmann machine. Before

VAE and GAN took over, Boltzmann machine [1, 17, 36]and Helmholtz machine [16] are two classes of models forgenerative modeling and unsupervised learning. Boltzmannmachine is the most prominent example of latent EBM. Thelearning consists of two phases. The positive phase sam-ples from the conditional distribution of the latent variablesgiven the observed example. The negative phase samplesfrom the joint distribution of the latent variables and theimage. The parameters are updated based on the differ-ences between statistical properties of positive and negativephases. Helmholtz machine can be considered a precursorof VAE. It consists of a top-down generation model and thebottom-up recognition model. The learning also consistsof two phases. The wake phase infers the latent variablesbased on the recognition model and updates the parametersof the generation model. The sleep phase generates syn-thesized data from the generation model and updates theparameters of the recognition model. Our work seeks tointegrate the two classes of models.

(3) Divergence triangle for joint training of genera-tor network and energy-based model. The generator andenergy-based model can be trained separately using maxi-mum likelihood criteria as in [13, 33], and they can also betrained jointly as recently explored by [20, 38, 14, 24]. Inparticular, [14] proposes a divergence triangle criterion forjoint training. Our training criterion is also in the form ofdivergence triangle. However, the EBM in these papers isonly defined on the image and there is no latent vector in theEBM. In our work, we employ latent EBM which defines ajoint density on the latent vector and the image, thus thisundirected joint density is more natural match to the gen-erator density and the data density, both are directed jointdensities of the latent vector and the image.

3. Models and learning

3.1. Generator network

Let z be the d-dimensional latent vector. Let x be theD-dimensional signal, such as an image. VAE consists of agenerator model, which defines a joint probability density

pθ(x, z) = p(z)pθ(x|z), (1)

in a top-down direction, where p(z) is the known prior dis-tribution of the latent vector z, such as uniform distribu-tion or Gaussian white noise, i.e., z ∼ N(0, Id), where Idis the d-dimensional identity matrix. pθ(x|z) is the condi-tional distribution of x given z. A typical form of pθ(x|z)is such that x = gθ(z) + ε, where gθ(z) is parametrized bya top-down convolutional network (also called the deconvo-lutional network due to the top-down direction), with θ col-lecting all the weight and bias terms of the network. ε is theresidual noise, and usually it is assumed ε ∼ N(0, σ2ID).

Page 3: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

The generator network is a directed model. We callpθ(x, z) the generator density. x can be sampled directlyby first sampling z and then sampling x given z. This issometimes called ancestral sampling in the literature [30].

The marginal distribution pθ(x) =∫pθ(x, z)dz. It is not

in closed form. Thus the generator network is sometimescalled the implicit generative model in the literature.

The inference of z can be based on the posterior distribu-tion of z given x, i.e., pθ(z|x) = pθ(x, z)/pθ(x). pθ(z|x)is not in closed form.

3.2. Inference model

VAE assumes an inference model qφ(z|x) with a sep-arate set of parameters φ. One example of qφ(z|x) isN(µφ(x), Vφ(x)), where µφ(x) is the d-dimensional meanvector, and Vφ(x) is the d-dimensional diagonal variance-covariance matrix. Both µφ and Vφ can be parametrized bybottom-up convolutional networks, whose parameters aredenoted by φ. The inference model qφ(z|x) is a closed formapproximation to the true posterior pθ(z|x).

3.3. Data density

Let qdata(x) be the distribution that generates the ob-served images. In practice, expectation with respect toqdata(x) can be approximated by the average over the ob-served training examples.

The reason we use the notation q to denote the data distri-bution qdata(x) is that qdata(x) can be naturally combinedwith the inference model qφ(z|x), so that we have the jointdensity

qφ(x, z) = qdata(x)qφ(z|x). (2)

The above is also a directional density in that it can be fac-torized in a bottom-up direction. We may call the joint den-sity qφ(x, z) the data density, where in the terminology ofthe EM algorithm, we may consider z as the missing data,and qφ(z|x) as the imputation model of the missing data.

3.4. VAE

The top-down generator density pθ(x, z) = p(z)pθ(x|z)and the bottom-up data density qφ(x, z) = qdata(x)qφ(z|x)form a natural pair. As noted by [14], VAE can be viewedas the following joint minimization

minθ

minφ

KL(qφ(x, z)‖pθ(x, z)), (3)

where for two densities q(x) and p(x) in general,KL(q(x)‖p(x)) = Eq[log(q(x)/p(x)] is the Kullback-Leibler divergence between q and p.

To connect the above joint minimization to the usualform of VAE,

KL(qφ(x, z)‖pθ(x, z)) = KL(qdata(x)‖pθ(x)) (4)+ Eqdata(x)[KL(qφ(z|x)‖pθ(z|x))], (5)

where for two joint densities q(x, y) and p(x, y), we defineKL(q(y|x)‖p(y|x)) = Eq(x|y)[log(q(y|x)/p(y|x)) ≥ 0.

Since KL(qdata(x)‖pθ(x)) = Eqdata(x)[log qdata(x)] −Eqdata(x)[log pθ(x)], the joint minimization problem isequivalent to the joint maximization of

Eqdata(x)[log pθ(x)−KL(qφ(z|x)‖pθ(z|x))] (6)= Eqdata(x)[Eqφ(z|x)[log pθ(x, z)]− Eqφ(z|x)[log qφ(z|x)],

which is the lower bound of the log-likelihood used inVAE [23].

It is worth of noting that the wake-sleep algorithm [16]for training the Helmholtz machine consists of (1) wakephase: minθ KL(qφ(x, z)‖pθ(x, z)), and (2) sleep phase:minφKL(pθ(x, z)‖qφ(x, z)). The sleep phase reverses theorder of KL-divergence.

3.5. Latent EBM

Unlike the directed joint densities pθ(x, z) =p(z)pθ(x|z) in the generator network, and qφ(x, z) =qdata(x)qφ(z|x) in the data density, the latent EBM definesan undirected joint density, albeit an unnormalized one:

πα(x, z) =1

Z(α)exp[fα(x, z)], (7)

where −fα(x, z) is the energy function (a term origi-nated from statistical physics) defined on the image x andthe latent vector z. Z(α) =

∫ ∫exp[fα(x, z)]dxdz is

the normalizing constant. It is usually intractable, andexp[fα(x, z)] is an unnormalized density. The most promi-nent example of latent EBM is the Boltzmann machine [1,17], where fα(x, z) consists of pairwise potentials. In ourwork, we first encode x into a vector and then concatenatethis vector with the vector z, and then get fα(x, z) by a net-work defined on the concatenated vector.

3.6. Inference and synthesis sampling

Let πα(x) =∫πα(x, z)dz be the marginal density of

latent EBM. The maximum likelihood learning of α isbased on minαKL(qdata(x)‖πα(x)) because minimizingKL(qdata(x)‖πα(x)) is equivalent to maximizing the log-likelihood Eqdata(x)[log πα(x)]. The learning gradient is

∂αEqdata(x)[log πα(x)] = Eqdata(x)πα(z|x)

[∂

∂αfα(x, z)

]− Eπα(x,z)

[∂

∂αfα(x, z)

]. (8)

This is a well known result in latent EBM [1, 25]. On theright hand of the above equation, the two expectations canbe approximated by Monte Carlo sampling. For each ob-served image, sampling from πα(z|x) is to infer z fromx. We call it the inference sampling. In the literature, it is

Page 4: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

called the positive phase [1, 17]. It is also called clampedsampling where x is an observed image and is fixed. Sam-pling from πα(x, z) is to generate synthesized examplesfrom the model. We call it the synthesis sampling. In theliterature, it is called the negative phase. It is also calledunclamped sampling where x is also generated from themodel.

4. Joint training4.1. Objective function of joint training

We have the following three joint densities.(1) The generator density pθ(x, z) = p(z)pθ(x|z).(2) The data density qφ(x, z) = qdata(x)qφ(z|x).(3) The latent EBM density πα(x, z).We propose to learn the generator model parametrized

by θ, the inference model parametrized by φ, and the la-tent EBM parametrized by α by the following divergencetriangle:

minθ

minφ

maxα

L(θ, α, φ), (9)

L = KL(qφ‖pθ) + KL(pθ‖πα)−KL(qφ‖πα), (10)

where all the densities qφ, pθ, and πα are joint densities of(x, z).

The above objective function is in an symmetric and anti-symmetric form. The anti-symmetry is caused by the nega-tive sign in front of KL(qφ‖πα) and the maximization overα.

4.2. Learning of latent EBM

For learning the latent EBM, maxα L is equivalent tominimizing

LE(α) = KL(qφ‖πα)−KL(pθ‖πα). (11)

In the above minimization, πα seeks to get close to the datadensity qφ and get away from pθ. Thus πα serves as a criticof pθ by comparing pθ against qφ. Because of the undirectedform of πα, it can be more flexible than the directional pθin approximating qφ.

The gradient of the above minimization is

d

dαLE(α) = −Eqdata(x)qφ(z|x)

[∂

∂αfα(x, z)

]+ Epθ(x,z)

[∂

∂αfα(x, z)

]. (12)

Comparing Eqn.12 to Eqn.8, we replace πα(z|x) byqφ(z|x) in the inference sampling in the positive phase, andwe replace πα(x, z) by pθ(x, z) in the synthesis samplingin the negative phase. Both qφ(z|x) and pθ(x, z) can besampled directly. Thus the joint training enables the latent

EBM to avoid MCMC in both inference sampling and syn-thesis sampling. In other words, the inference model servesas an approximate inference sampler for latent EBM, andthe generator network serves as an approximate synthesissampler for latent EBM.

4.3. Learning of generator network

For learning the generator network, minθ L is equivalentto minimizing

LG(θ) = KL(qφ‖pθ) + KL(pθ‖πα). (13)

where the gradient can be computed as:

d

dθLG(θ) = −Eqdata(x)qφ(z|x)

[∂

∂θlog pθ(x|z)

]− ∂

∂θEpθ(x,z) [fα(x, z)] . (14)

In KL(qφ‖pθ), pθ appears on the right hand side of KL-divergence. Minimizing this KL-divergence with respectto θ requires pθ to cover all the major modes of qφ. If pθis not flexible enough, it will strain itself to cover all themajor modes, and as a result, it will make pθ over-dispersedthan qφ. This may be the reason that VAE tends to suffer insynthesis quality.

However, in the second term, KL(pθ‖πα), pθ appears onthe left hand side of KL-divergence, and πα, which seeksto get close to qφ and get away from pθ in its dynamics,serves as a surrogate for data density qφ, and a target for pθ.Because pθ appears on the left hand side of KL-divergence,it has the mode chasing behavior, i.e., it may chase somemajor modes of πα (surrogate of qφ), while it does not needto cover all the modes. Also note that in KL(pθ‖πα), we donot need to know Z(α) because it is a constant as far as θ isconcerned.

Combine the above two KL-divergences, approximately,we minimize a symmetrized version of KL-divergenceS(qφ‖pθ) = KL(qφ‖pθ) + KL(pθ‖qφ) (assuming πα isclose to qφ). This will correct the over-dispersion of VAE,and improve the synthesis quality of VAE.

We refer the reader to the textbook [10, 31] on the dif-ference between minpKL(q‖p) and minpKL(p‖q). In theliterature, they are also referred to as inclusive and exclusiveKL, or KL and reverse KL.

4.4. An adversarial chasing game

The dynamics of πα is that it seeks to get close to the datadensity qφ and get away from pθ. But the dynamics of pθis that it seeks to get close to πα (and at the same time alsoget close to the data density qφ). This defines an adversarialchasing game, i.e., πα runs toward qφ and runs from pθ,while pθ chases πα. As a result, πα leads pθ toward qφ. pθand πα form an actor-critic pair.

Page 5: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

4.5. Learning of inference model

The learning of the inference model qφ(z|x) can bebased on minφ L(θ, α, φ), which is equivalent to minimiz-ing

LI(φ) = KL(qφ‖pθ)−KL(qφ‖πα). (15)

qφ(z|x) seeks to be close to pθ(z|x) relative to πα(z|x).That is, qφ(z|x) seeks to be the inference model for pθ.Meanwhile, πα(z|x) seeks to be close to qφ(z|x). This isalso a chasing game. qφ(z|x) leads πα(z|x) to be close topθ(z|x).

The gradient ofLI(φ) in Eqn.15 can be readily computedas:

d

dφLI(φ) =

∂φEqφ(x,z) [log qφ(x, z)− log pθ(x, z)]

− ∂

∂φEqφ(x,z) [log qφ(x, z)− fα(x, z)] . (16)

We may also learn qφ(z|x) by minimizing

LI(φ) = KL(qφ‖pθ) + KL(qφ‖πα), (17)

where we let qφ(z|x) to be close to both pθ(z|x) andπα(z|x) in variational approximation.

4.6. Algorithm

Algorithm 1 Joint training for VAE and latent EBM

Require:training images {xi}ni=1; number of learning iterationsT ; α, θ, φ← initialized network parameters.

Ensure:estimated parameters {α, θ, φ}; generated samples{xi}ni=1.

1: Let t← 0.2: repeat3: Synthesis sampling for (zi, xi)

Mi=1 using Eqn.18.

4: Inference sampling for (xi, zi)Mi=1 using Eqn.18.

5: Learn latent EBM: Given {zi, xi}Mi=1 and{xi, zi}Mi=1, update α ← α − ηαL

′E(α) using

Eqn. 19 with learning rate ηα.6: Learn inference model: Given {xi, zi}Mi=1, update

φ ← φ − ηφL′I(φ), with learning rate ηφ using

Eqn. 20.7: Learn generator network: Given {zi, xi}Mi=1 and

{xi, zi}Mi=1, update θ ← θ − ηθL′G(θ), with learning

rate ηθ using Eqn. 21.8: Let t← t+ 1.9: until t = T

The latent EBM, generator and inference model can bejointly trained using stochastic gradient descent based on

Eqn.12, Eqn.14 and Eqn.16. In practice, we use sampleaverage to approximate the expectation.Synthesis and inference sampling. The expectationsfor gradient computation are based on generator densitypθ(x, z) and data density qφ(x, z). To approximate the ex-pectation of generator density Epθ(x,z)[.], we perform syn-thesis sampling through z ∼ p(z), x ∼ pθ(x|z) to get Msamples (zi, xi). To approximate the expectation of datadensity Eqφ(x,z)[.], we perform inference sampling throughx ∼ qdata(x), z ∼ qφ(z|x) to get M samples (xi, zi). Bothpθ(x|z) and qφ(z|x) are assumed to be Gaussian, thereforewe have:

x = gθ(z) + σe1, e1 ∼ N(0, ID);

z = µφ(x) + Vφ(x)1/2e2, e2 ∼ N(0, Id), (18)

where gθ(z) is the top-down deconvolutional network forthe generator model (see Sec 3.1), and µφ(x) and Vφ(x)are bottom-up convolutional networks for the mean vectorand the diagonal variance-covariance matrix of the infer-ence model (see Sec 3.2). We follow the common prac-tice [11] to have x directly from generator network, i.e.,x ≈ gθ(z). Note that the synthesis sample (z, x) and theinference sample (x, z) are functions of the generator pa-rameter θ and the inference parameter φ respectively whichensure gradient back-propagation.Model learning. The obtained synthesis samples and infer-ence samples can be used to approximate the expectationsin model learning. Specifically, for latent EBM learning,the gradient in Eqn.12 can be approximated by:

d

dαLE(α) ≈ −

1

M

M∑i=1

[∂

∂αfα(xi, zi)

]

+1

M

M∑i=1

[∂

∂αfα(xi, zi)

]. (19)

For inference model, the gradient in Eqn.16 can be approx-imated by:

d

dφLI(φ) ≈

∂φ

1

M

M∑i=1

[log qφ(xi, zi)− log pθ(xi, zi)]

− ∂

∂φ

1

M

M∑i=1

[log qφ(xi, zi)− fα(xi, zi)] .

(20)

For generator model, the gradient in Eqn.14 can be approx-imated by:

d

dθLG(θ) ≈ −

1

M

M∑i=1

[∂

∂θlog pθ(xi|zi)

]

− ∂

∂θ

1

M

M∑i=1

[fα(xi, zi)] . (21)

Page 6: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

Notice that the gradients in Eqn.20 and Eqn.21 on synthe-sis samples (zi, xi) and inference samples (xi, zi) can beeasily back-propagated using Eqn.18. The detailed trainingprocedure is presented in Algorithm 1.

5. ExperimentsIn this section, we evaluate the proposed model on four

tasks: image generation, test image reconstruction, out-of-distribution generalization, and anomaly detection. Thelearning of inference model is based on Eqn.15, and wealso tested the alternative way to train inference model us-ing Eqn.17 for generation and reconstruction. We mainlyconsider 4 datasets which includes CIFAR-10, CelebA [27],Large-scale Scene Understand (LSUN) dataset [39] andMNIST. We will describe the datasets in more detail inthe following relevant subsections. All the training imagedatasets are resized and scaled to [−1, 1] with no furtherpre-processing. All network parameters are initialized withzero-mean gaussian with standard deviation 0.02 and op-timized using Adam [21]. We adopt a similar deconvolu-tional network structure as in [34] for generator model andthe “mirror” convolutional structure for inference model.Both structures involve batch normalization [19]. For jointenergy model πα(x, z), we use multiple layers of convo-lution to transform the observation x and the latent factorz, then concatenate them at the higher layer which sharessimilarity as in [26]. Spectral normalization is used as sug-gested in [29]. We refer to our project page1 for details.

5.1. Image generation

In this experiment, we evaluate the visual quality of thegenerated samples. The well-learned generator network pθcould generate samples that are realistic and share visualsimilarities as the training data. We mainly consider threecommon-used datasets including CIFAR-10, CelebA [27]and LSUN [39] for generation and reconstruction evalua-tion. The CIFAR-10 contains 60,000 color images of size32× 32 of which 50,000 images are for training and 10,000are for testing. For CelebA dataset, we resize them to be64× 64 and randomly select 10,000 images of which 9,000are for training and 1,000 are for testing. For LSUN dataset,we select the bedroom category which contains roughly 3million images and resize them to 64 × 64. We separate10,000 images for testing and use the rest for training. Thequalitative results are shown in Figure 1.

We further evaluate our model quantitatively by us-ing Frechet Inception Distance (FID) [28] in Table 1.We compare with baseline models including VAE [23],DCGAN [34], WGAN [2], CoopNet [38], ALICE [26],SVAE [4] and SNGAN [29]. The FID scores are from therelevant papers, and for missing evaluations, we re-evaluate

1https://hthth0801.github.io/jointLearning/

them by utilizing their released codes or re-implement us-ing similar structures and optimal parameters as indicatedin their papers. From Table 1, our model achieves com-petitive generation performance compared to listed base-line models. Further compared to [14] which has 7.23 In-ception Score (IS) on CIFAR-10 and 31.9 FID on CelebA,our model has 7.17 IS and 24.7 FID respectively. It canbe shown that our joint training can greatly improve thesynthesis quality compared to VAE alone. Note that theSNGAN [29] get better generation on CIFAR-10 which hasrelatively small resolution, while on other datasets that haverelatively high resolution and diverse patterns, our modelobtains more favorable results and has more stable training.

5.2. Testing image reconstruction

In this experiment, we evaluate the accuracy of thelearned inference model by testing image reconstruction.The well-trained inference model should not only help tolearn the latent EBM model but also learn to match thetrue posterior pθ(z|x) of the generator model. Therefore,in practice, a well-learned inference model can be balancedto render both realistic generation, as we show in the pre-vious section, as well as faithful reconstruction on testingimages.

We evaluate the model on hold-out testing set of CIFAR-10, CelebA and LSUN-bedroom. Specifically, we use itsown 10,000 testing images for CIFAR-10, 1,000 and 10,000hold-out testing images for CelebA and LSUN-bedroom.The testing images and the corresponding reconstructionsare shown in Figure 2. We also quantitatively comparewith baseline models (ALI [9], ALICE [26], SVAE [4])using Rooted Mean Square Error (RMSE). Note that forthis experiment, we only compare with the relevant baselinemodels that contain joint discriminator on (x, z) and couldachieve the decent generation quality. Besides, we do notconsider the GANs and their variants because they have noinference model involved and are infeasible for image re-construction. Table 2 shows the results. VAE is naturallyintegrated into our probabilistic model for joint learning.However, using VAE alone can be extremely ineffective oncomplex dataset. Our model instead achieves both the highgeneration quality as well as the accurate reconstruction.

5.3. Out-of-distribution generalization

In this experiment, we evaluate out-of-distribution(OOD) detection using the learned latent EBM πα(x, z). Ifthe energy model is well-learned, then the training image,together with its inferred latent factor, should form local en-ergy minima. Unseen images from other distribution otherthan training ones should be assigned to relatively higherenergies. This is closely related to the model of associativememory as observed by Hopfield [18].

We learn the proposed model on CIFAR-10 training im-

Page 7: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

Figure 1: Generated samples. Left: CIFAR-10 generation. Middle: CelebA generation. Right: LSUN bedroom generation.

Model VAE DCGAN WGAN CoopNet ALICE SVAE SNGAN Ours(+) OursCIFAR-10 109.5 37.7 40.2 33.6 48.6 43.5 29.3 33.3 30.1

CelebA 99.09 38.4 36.4 56.6 46.1 40.7 50.4 29.5 24.7LSUN 175.2 70.4 67.7 35.4 72 - 67.8 31.4 27.3

Table 1: Sample quality evaluation using FID scores on various datasets. Ours(+) denotes our proposed method with inference modeltrained using Eqn.17.

Figure 2: Test image reconstruction. Top: CIFAR-10. Bottom:CelebA. Left: test images. Right: reconstructed images.

ages, then utilize the learned energy model to classify theCIFAR-10 testing images from other OOD images using en-ergy value (i.e., negative log likelihood). We use area under

Model CIFAR-10 CelebA LSUN-bedroomVAE 0.192 0.197 0.164ALI 0.558 0.720 -

ALICE 0.185 0.214 0.181SVAE 0.258 0.209 -

Ours(+) 0.184 0.208 0.169Ours 0.177 0.190 0.169

Table 2: Testing image reconstruction evaluation usingRMSE. Ours(+) denotes our proposed method with infer-ence model trained using Eqn.17.

the ROC curve (AUROC) scores as our OOD metric follow-ing [15] and we use Textures [5], uniform noise, SVHN [32]and CelebA images as OOD distributions (Figure 4 pro-vides CIFAR-10 test images and examples of OOD images).We compare with ALICE [26], SVAE [4] and the recentEBM [8] as our baseline models. The CIFAR-10 trainingfor ALICE, SVAE follow the network, hyperparameters andscores for EBM are taken directly from [8]. Table 3 showsthe AUROC scores. We also provide histograms of rela-tive likelihoods for OOD distributions in Figure 3 whichcan further verify that images from OOD distributions areassigned to relatively low log likelihood (i.e., high energy)compared to the training distribution. Our latent EBM couldbe learned to assign low energy for training distribution andhigh energy for data that comes from OOD distributions.

Page 8: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

Figure 3: Histogram of log likelihood (unnormalized) for various datasets. We provide the histogram comparison between CIFAR-10 testset and CelebA, Uniform Random, SVHN, Texture and CIFAR-10 train set respectively.

CIFAR SVHN Uniform Texture CelebA

Figure 4: Illustration of images from CIFAR-10 test, SVHN, Uni-form Random, Texture and CelebA. The last four are consideredto be OOD distributions.

Model SVHN Uniform Texture CelebAEBM 0.63 1.0 0.48 -

ALICE 0.29 0.0 0.40 0.48SVAE 0.42 0.29 0.5 0.52Ours 0.68 1.0 0.56 0.56

Table 3: AUROC scores of OOD classification on various imagesdatasets. All models are learned on CIFAR-10 train set.

5.4. Anomaly detection

In this experiment, we take a closer and more gen-eral view of the learned latent EBM with applications toanomaly detection. Unsupervised anomaly detection is oneof the most important problems in machine learning and of-fers great potentials in many areas including cyber-security,medical analysis and surveillance etc. It is similar to theout-of-distribution detection discussed before, but can bemore challenging in practice because the anomaly data maycome from the distribution that is similar to and not entirelyapart from the training distribution. We evaluate our modelon MNIST benchmark dataset.

The MNIST dataset contains 60,000 gray-scale images ofsize 28 × 28 depicting handwritten digits. Following thesame experiment setting as [24, 40], we make each digitclass an anomaly and consider the remaining 9 digits asnormal examples. Our model is trained with only normaldata and tested with both normal and anomalous data. Weuse energy function as our decision function and comparewith the BiGAN-based anomaly detection model [40], therecent MEG [24] and the VAE model using area under the

precision-recall curve (AUPRC) as in [40]. Table 4 showsthe results.

Holdout VAE MEG BiGAN-σ Ours1 0.063 0.281 ± 0.035 0.287 ± 0.023 0.297 ± 0.0334 0.337 0.401 ±0.061 0.443 ± 0.029 0.723 ± 0.0425 0.325 0.402 ± 0.062 0.514 ± 0.029 0.676 ± 0.0417 0.148 0.290 ± 0.040 0.347 ± 0.017 0.490 ± 0.0419 0.104 0.342 ± 0.034 0.307 ± 0.028 0.383 ± 0.025

Table 4: AUPRC scores for unsupervised anomaly detection onMNIST. Numbers are taken from [24] and results for our modelare averaged over last 10 epochs to account for variance.

6. ConclusionThis paper proposes a joint training method to learn

both the VAE and the latent EBM simultaneously, wherethe VAE serves as an actor and the latent EBM serves asa critic. The objective function is of a simple and com-pact form of divergence triangle that involves three KL-divergences between three joint densities on the latent vec-tor and the image. This objective function integrates bothvariational learning and adversarial learning. Our experi-ments show that the joint training improves the synthesisquality of VAE, and it learns reasonable energy function thatis capable of anomaly detection.

Learning well-formed energy landscape remains a chal-lenging problem, and our experience suggests that thelearned energy function can be sensitive to the setting ofhyper-parameters and within the training algorithm. In ourfurther work, we shall further improve the learning of theenergy function. We shall also explore joint training ofmodels with multiple layers of latent variables in the stylesof Helmholtz machine and Boltzmann machine.

AcknowledgmentThe work is supported by DARPA XAI project N66001-

17-2-4029; ARO project W911NF1810296; and ONRMURI project N00014-16-1-2007; and Extreme Scienceand Engineering Discovery Environment (XSEDE) grantASC170063.

Page 9: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

References[1] David H Ackley, Geoffrey E Hinton, and Terrence J Se-

jnowski. A learning algorithm for boltzmann machines. Cog-nitive science, 9(1):147–169, 1985.

[2] Martin Arjovsky, Soumith Chintala, and Leon Bottou.Wasserstein generative adversarial networks. In Interna-tional Conference on Machine Learning, pages 214–223,2017.

[3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan,Roy H Campbell, and Sergey Levine. Stochastic variationalvideo prediction. arXiv preprint arXiv:1710.11252, 2017.

[4] Liqun Chen, Shuyang Dai, Yunchen Pu, Erjin Zhou, Chun-yuan Li, Qinliang Su, Changyou Chen, and Lawrence Carin.Symmetric variational autoencoder and connections to ad-versarial learning. In International Conference on ArtificialIntelligence and Statistics, pages 661–669, 2018.

[5] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, SammyMohamed, and Andrea Vedaldi. Describing textures in thewild. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3606–3613, 2014.

[6] Arthur P Dempster, Nan M Laird, and Donald B Rubin.Maximum likelihood from incomplete data via the em al-gorithm. Journal of the royal statistical society. Series B(methodological), pages 1–38, 1977.

[7] Jeff Donahue, Philipp Krahenbuhl, and Trevor Darrell. Ad-versarial feature learning. arXiv preprint arXiv:1605.09782,2016.

[8] Yilun Du and Igor Mordatch. Implicit generation andgeneralization in energy-based models. arXiv preprintarXiv:1903.08689, 2019.

[9] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, OlivierMastropietro, Alex Lamb, Martin Arjovsky, and AaronCourville. Adversarially learned inference. arXiv preprintarXiv:1606.00704, 2016.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deeplearning. MIT press, 2016.

[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014.

[12] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo JimenezRezende, and Daan Wierstra. Draw: A recurrentneural network for image generation. arXiv preprintarXiv:1502.04623, 2015.

[13] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Al-ternating back-propagation for generator network. In AAAI,volume 3, page 13, 2017.

[14] Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Divergence triangle for jointtraining of generator model, energy-based model, and in-ferential model. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 8670–8679, 2019.

[15] Dan Hendrycks and Kevin Gimpel. A baseline for detect-ing misclassified and out-of-distribution examples in neuralnetworks. arXiv preprint arXiv:1610.02136, 2016.

[16] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Rad-ford M Neal. The” wake-sleep” algorithm for unsupervisedneural networks. Science, 268(5214):1158–1161, 1995.

[17] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. Afast learning algorithm for deep belief nets. Neural compu-tation, 18(7):1527–1554, 2006.

[18] John J Hopfield. Neural networks and physical systems withemergent collective computational abilities. Proceedings ofthe National Academy of Sciences, 79(8):2554–2558, 1982.

[19] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. arXiv preprint arXiv:1502.03167, 2015.

[20] Taesup Kim and Yoshua Bengio. Deep directed genera-tive models with energy-based probability estimation. arXivpreprint arXiv:1606.03439, 2016.

[21] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[22] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende,and Max Welling. Semi-supervised learning with deep gen-erative models. In Advances in neural information process-ing systems, pages 3581–3589, 2014.

[23] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114, 2013.

[24] Rithesh Kumar, Anirudh Goyal, Aaron Courville, andYoshua Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.

[25] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, andF Huang. A tutorial on energy-based learning. Predictingstructured data, 1(0), 2006.

[26] Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, LiqunChen, Ricardo Henao, and Lawrence Carin. Alice: To-wards understanding adversarial learning for joint distribu-tion matching. In Advances in Neural Information Process-ing Systems, pages 5495–5503, 2017.

[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild. In Proceedings ofInternational Conference on Computer Vision (ICCV), 2015.

[28] Mario Lucic, Karol Kurach, Marcin Michalski, SylvainGelly, and Olivier Bousquet. Are gans created equal? alarge-scale study. arXiv preprint arXiv:1711.10337, 2017.

[29] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative ad-versarial networks. arXiv preprint arXiv:1802.05957, 2018.

[30] Shakir Mohamed and Balaji Lakshminarayanan. Learn-ing in implicit generative models. arXiv preprintarXiv:1610.03483, 2016.

[31] Kevin P Murphy. Machine learning: a probabilistic perspec-tive. MIT press, 2012.

[32] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-sacco, Bo Wu, and Andrew Y Ng. Reading digits in naturalimages with unsupervised feature learning. 2011.

[33] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, andYing Nian Wu. On the anatomy of mcmc-based maximumlikelihood learning of energy-based models. arXiv preprintarXiv:1903.12370, 2019.

Page 10: Joint Training of Variational Auto-Encoder and Latent Energy ...sczhu/papers/Conf_2020/CVPR2020_VAE...Joint Training of Variational Auto-Encoder and Latent Energy-Based Model Tian

[34] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

[35] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-stra. Stochastic backpropagation and approximate inferencein deep generative models. arXiv preprint arXiv:1401.4082,2014.

[36] Ruslan Salakhutdinov and Hugo Larochelle. Efficient learn-ing of deep boltzmann machines. In Proceedings of the thir-teenth international conference on artificial intelligence andstatistics, pages 693–700, 2010.

[37] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe,Søren Kaae Sønderby, and Ole Winther. Ladder variationalautoencoders. In Advances in neural information processingsystems, pages 3738–3746, 2016.

[38] Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, andYing Nian Wu. Cooperative training of descriptor and gen-erator networks. arXiv preprint arXiv:1609.09408, 2016.

[39] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, ThomasFunkhouser, and Jianxiong Xiao. Lsun: Construction of alarge-scale image dataset using deep learning with humansin the loop. arXiv preprint arXiv:1506.03365, 2015.

[40] Houssam Zenati, Chuan Sheng Foo, Bruno Lecouat, Gau-rav Manek, and Vijay Ramaseshan Chandrasekhar. Ef-ficient gan-based anomaly detection. arXiv preprintarXiv:1802.06222, 2018.

[41] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprintarXiv:1609.03126, 2016.