variational generative stochastic networks with ......results we now brieﬂy look at: how...

Slides

Variational Generative Stochastic Networks with

Collaborative Shaping

Philip Bachman and Doina Precup– – – – –

McGill UniversitySchool of Computer Science

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Slides

What do we want from our model?

We want to develop a generative model that can:

Shape its distribution G to match a target distribution D.

Generate local random walks through G.Generate independent samples from G.Provide an e�cient estimate of log G(x).


Slides

What tools will we use?

We’ll approach this problem by combining recent work on:

Denoising auto-encoders as generative models– [BYAV13, BTLAY14]

I This will give us local random walks.

Variational inference for deep generative models– [KW14, RMW14]

I This will give us independent samples and log G(x).Approximate Bayesian Computation– [GDKC14a, GDKC14b, GPAM+14]

I This will keep the random walks near D.


Slides

Useful Terminology

Some definitions we’ll use to present our work:

x 2 X will indicate “observable” variables.

z 2 Z will indicate “latent” variables.

q�(z |x) – this is the corruption process.

p✓(x |z) – this is the reconstruction distribution.

p⇤(z) – this is the prior distribution.

G/D – these are the model/target distributions.


Slides

Denoising auto-encoders and Markov chains

A denoising auto-encoder trains p✓(x |z) to match theconditionals observed in pairs (x

i

, zi

) generated by samplingx

i

⇠ D and then z

i

⇠ q�(z |xi ).Noise typically interpreted as imposed on x can be moregenerally thought of as noise in the encoder q�(z |xi ).Given p✓(x |z) and q�(z |x), we can start with x

0

⇠ D and theniterate between sampling z

i

⇠ q�(z |xi ) and x

i+1

⇠ p✓(x |zi

).

This defines a Markov chain over x 2 X with transitionoperator T✓(xt+1

|xt

) /P

z

p✓(xt+1

|z)q�(z |xt).


Slides

GSNs and Simple GSNs

GSNs extend the standard denoising auto-encoder bychanging q�(z |xi ) to q�(z |xi , zi�1

).

This permits additional “hidden state” (in z

i�1

), which mayimprove Markov chain sampling behavior.

Simple GSNs are GSNs where q�(z |xi , zi�1

) ? z

i�1

.

Any GSN trained using walkback samples is equivalent to aSimple GSN with a corruption process W(z |x ; p✓, q�) definedby a procedural wrapper around p✓ and q�.

I But... what’s walkback?


Slides

Controlling dynamics of the q/p chain

Problem 1: if q�(z |x) is too local, the q/p chain won’t mixwell between separated modes of D.

Walkback reduces locality by iteratively applying q/p whengenerating each corrupt/reconstruct pair (x , z)1.

I We obtain a similar e↵ect by minimizing a modified VFE.

Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.

Walkback removes spurious attractors from the q/p chain byrepeatedly unrolling it and pulling it back to D.

I We obtain a similar e↵ect using techniques from ABC.

1

E.g., if q adds (small �, µ = 0) Gaussian noise p will be Gaussian and

walkback will (roughly) perform iterated Gaussian convolution.


Slides

Balancing local vs. non-local corruption in GSNs

q(z|x1) q(z|x2)

x1 x2

p(x|z1)

z1

D(x)

Z

X

(a)

q(z|x1) q(z|x2)

x1 x2

p(x|z1)

z1

D(x)

Z

X

(b)

Figure: (a) the corruption process is more local, making p(x |z) simplebut causing slow mixing between modes of D in the q/p chain. (b) thecorruption process is less local, making p(x |z) tricky but causing fastmixing between modes of D in the q/p chain.


Slides

Variational Simple GSNs

We train a VAE comprising q� and p✓ by minimizing2:

Ex⇠D

E

z⇠q�(z|x)� log p✓(x |z) + �KL(q�(z |x) || p⇤(z)))

�

We run a Markov chain by feeding the VAE back into itself.

� lets us control E(x

1

,x2

)⇠D KL(q�(z |x2) ||q�(z |x1)) which,roughly speaking, corresponds to the locality of q�.

But, we still need better control over the dynamics of theunrolled chain (see example videos).

We’ll use Approximate Bayesian Computation to shape thedistribution G emitted by our unrolled, self-looped VAE.

2

We also penalize “squared KL above the mean” – see our paper for details.


Slides

Shaping Markov chain dynamics with ABC

Approximate Bayesian Computation is (roughly) based ontraining a generative model by minimizing some measure ofdissimilarity between G and D.

Examples include moment matching (e.g. MMD, etc.), andclassification-based methods (e.g. GANs by Goodfellow et. aland related work by Gutmann et. al).

For our models, we:

Train a guide function f to estimate log DG .

Move mass in G towards increasing f when f < 0.

Don’t move G-mass emitted in regions where f > 0.

Show a global minimum occurs i↵ 8x D(x) = G(x)3.

3

The first three points, though vague, are su�cient for this to hold.


Slides

Summarizing our model

q(z|x1) q(z|x2)

x1 x2

D(x)

Z

X

p(z)

log(D/G)

It's flat!

(a)

x0 ⇠ D(x)

x0 x1 x2

z0 z1

z0 ⇠ W(z|x0; p✓, q�)

x1 ⇠ p✓(x|z0)

f f f (xi) ⇡ logD(xi)

G(xi)

(b)

Figure: (a) the top figure gives a schematic for optimizing variationalfree-energy – the bottom figure shows valid ABC loss functions for q/pchain samples based on f ⇡ log D

G . (b) the full graph for our model.


Slides

Results

We now briefly look at:

How re-weighting the free-energy KLd term a↵ects G.How collaborative unrolling a↵ects chain dynamics.

How our models perform on di↵erent quantitative tests.


Slides

MNIST – KLd comparison, Side-by-side

(a) � = 1, � = 0 (b) � = 4, � = 0.1 (c) � = 24, � = 0.1

Figure: Side-by-side comparison of independent samples from VAEstrained with varying strengths of KLd penalty. Scores on GPDE test were(L-to-R): 220, 265, and 330.


Slides

Chain behavior with and without guided unrolling

(a) (b)

Figure: Comparing chains generated by models learned with and withoutcollaborative shaping. The samples in (a) were generated by acorrupt-reconstruct pair q�/p✓ trained for 100k updates as a variationalauto-encoder (VAE), and then 100k updates as a 6-step unrolled,collaboratively-guided chain. Samples in (b) are from the same modelbut with 200k updates of standard VAE training.


Slides

Visualizing Markov chain dynamics – MNIST

.


Shaping Chain diversity

none

6 steps

Slides

Visualizing Markov chain dynamics – TFD

.


Standard free-energy Strong KL Regularization

Slides

Conclusion

Problem 1: if q�(z |x) is too local, the chain generated byT✓(xt+1

|xt

) = 1

C

Pz

p✓(xt+1

|z)q�(z |xt) won’t mix wellbetween separated modes of D.

I We address this using modified KL terms in the VFE.

Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.

I We address this using collaborative unrolling.

Problem 3: When q�(z |x) is non-local, p✓(x |z) may need tocapture sophisticated structure, e.g. multi-modality.

I Future work – construct q/p peu a peu – see our poster/talkat EWRL or our poster at the DL workshop.

Questions?


Slides

References I

Yoshua Bengio, Eric Thibodeau-Laufer, Guillaume Alain, andJason Yosinski, Deep generative stochastic networks trainable

by backprop, arXiv:1306.1091v5 [cs.LG] (2014).

Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent,Generalized denoising auto-encoders as generative models,Advances in Neural Information Processing Systems (NIPS),2013.

Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, andJukka Corander, Classifier abc, MCMSki IV (posters), 2014.

, Likelihood-free inference via classification,arXiv:1407.4981v1 [stat.CO], 2014.


Slides

References II

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, Generative adversarial nets, Advances inNeural Information Processing Systems (NIPS), 2014.

Diederik P Kingma and Max Welling, Auto-encodingvariational bayes, International Conference on LearningRepresentations (ICLR), 2014.

Danilo Rezende, Shakir Mohamed, and Daan Wierstra,Stochastic backpropagation and approximate inference in deep

generative models, International Conference on MachineLearning (ICML), 2014.


Slides

Graphical models for DAEs, GSNs, and Simple GSNs

x0 ⇠ D(x)

x0 x1 x2

x0 x1

x1 ⇠ p✓(x|x0)

x0 ⇠ q�(x|x0)

(g)

x0 x1 x2

z0 z1 z2

z0 ⇠ q�(z)

x0 ⇠ p✓(x|z0)

z1 ⇠ q�(z|x0, z0)

(h)

x0 ⇠ D(x)

x0 x1 x2

z0 z1

z0 ⇠ W(z|x0; p✓, q�)

x1 ⇠ p✓(x|z0)

(i)

Figure: (a) The Markov chain for a Generalized DAE. (b) The Markovchain for a GSN. (c) The Markov chain for a Simple GSN using acorruption process W(z |x ; q�, p✓) formed via the walkback procedure.


Slides

Wrapping p✓ and q� via Walkback

Input: data sample x , corruptor q�, reconstructor p✓Initialize an empty training pair list P

xz

= { }Set z to some initial vector 2 Z.for i = 1 to k

burn�in

do

Sample z from q�(z |x , z) then set z to z .end for

Set x to x .for i = 1 to k

roll�out

do

Sample z from q�(z |x , z) then set z to z .Sample x from p✓(x |z) then set x to x .Add pair (x , z) to P

xz

.end for

Return: Pxz

.


Slides

GSN theory – 1

Theorem

If p✓(x |z) is a consistent estimator of the true conditional

distribution P(x |z) and the transition operator T✓(xt+1

|xt

) thatsamples z

t

from q�(zt |xt) and samples x

t+1

from p✓(xt+1

|zt

)defines an ergodic Markov chain, then as the number of examples

used to train p✓(x |z) goes to infinity (i.e. as p✓(x |z) converges toP(x |z)), the asymptotic distribution of the Markov chain with

transition operator T✓(xt+1

|xt

) converges to target distribution D.

– Theorem modified slightly from “Generalized DenoisingAuto-encoders as Generative Models” by Bengio et. al, NIPS 2013.


Slides

GSN theory – 2

Corollary

Let X be a set in which every pair of points is connected by a

finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S

x

✓ X such that all paths between x

and any point in X \ Sx

pass through some point in Sx

whose

shortest path to x has length > 0. Suppose that

8x , 8x 0 2 Sx

[ {x}, 9zxx

0such that q�(z

xx

0 |x) > 0 and

p✓(x 0|zxx

0) > 0. Then, the Markov chain with transition operator

T✓(xt+1

|xt

) =P

z

p✓(xt+1

|z)q�(z |xt) is ergodic.

– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.


Slides

GSN theory – 2

Corollary

Let X be a set in which every pair of points is connected by a

finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S

x

✓ X such that all paths between x

and any point in X \ Sx

pass through some point in Sx

whose

shortest path to x has length > 0. Suppose that

8x , 8x 0 2 Sx

[ {x}, 9zxx

0such that q�(z

xx

0 |x) > 0 and

p✓(x 0|zxx

0) > 0. Then, the Markov chain with transition operator

T✓(xt+1

|xt

) =P

z

p✓(xt+1

|z)q�(z |xt) is ergodic.

– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.


Slides

MNIST – Weak KLd penalty

(a) (b)

Figure: Models trained with � = 1 and � = 0.0. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.


variational generative stochastic networks with ......results we now brieﬂy look at: how...

Documents