variational generative stochastic networks with ......results we now briefly look at: how...

33
Slides Variational Generative Stochastic Networks with Collaborative Shaping Philip Bachman and Doina Precup ––––– McGill University School of Computer Science Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Upload: others

Post on 26-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Variational Generative Stochastic Networks with

Collaborative Shaping

Philip Bachman and Doina Precup– – – – –

McGill UniversitySchool of Computer Science

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 2: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

What do we want from our model?

We want to develop a generative model that can:

Shape its distribution G to match a target distribution D.

Generate local random walks through G.Generate independent samples from G.Provide an e�cient estimate of log G(x).

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 3: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

What tools will we use?

We’ll approach this problem by combining recent work on:

Denoising auto-encoders as generative models– [BYAV13, BTLAY14]

I This will give us local random walks.

Variational inference for deep generative models– [KW14, RMW14]

I This will give us independent samples and log G(x).Approximate Bayesian Computation– [GDKC14a, GDKC14b, GPAM+14]

I This will keep the random walks near D.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 4: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Useful Terminology

Some definitions we’ll use to present our work:

x 2 X will indicate “observable” variables.

z 2 Z will indicate “latent” variables.

q�(z |x) – this is the corruption process.

p✓(x |z) – this is the reconstruction distribution.

p⇤(z) – this is the prior distribution.

G/D – these are the model/target distributions.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 5: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Denoising auto-encoders and Markov chains

A denoising auto-encoder trains p✓(x |z) to match theconditionals observed in pairs (x

i

, zi

) generated by samplingx

i

⇠ D and then z

i

⇠ q�(z |xi ).Noise typically interpreted as imposed on x can be moregenerally thought of as noise in the encoder q�(z |xi ).Given p✓(x |z) and q�(z |x), we can start with x

0

⇠ D and theniterate between sampling z

i

⇠ q�(z |xi ) and x

i+1

⇠ p✓(x |zi

).

This defines a Markov chain over x 2 X with transitionoperator T✓(xt+1

|xt

) /P

z

p✓(xt+1

|z)q�(z |xt).

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 6: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

GSNs and Simple GSNs

GSNs extend the standard denoising auto-encoder bychanging q�(z |xi ) to q�(z |xi , zi�1

).

This permits additional “hidden state” (in z

i�1

), which mayimprove Markov chain sampling behavior.

Simple GSNs are GSNs where q�(z |xi , zi�1

) ? z

i�1

.

Any GSN trained using walkback samples is equivalent to aSimple GSN with a corruption process W(z |x ; p✓, q�) definedby a procedural wrapper around p✓ and q�.

I But... what’s walkback?

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 7: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Controlling dynamics of the q/p chain

Problem 1: if q�(z |x) is too local, the q/p chain won’t mixwell between separated modes of D.

Walkback reduces locality by iteratively applying q/p whengenerating each corrupt/reconstruct pair (x , z)1.

I We obtain a similar e↵ect by minimizing a modified VFE.

Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.

Walkback removes spurious attractors from the q/p chain byrepeatedly unrolling it and pulling it back to D.

I We obtain a similar e↵ect using techniques from ABC.

1

E.g., if q adds (small �, µ = 0) Gaussian noise p will be Gaussian and

walkback will (roughly) perform iterated Gaussian convolution.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 8: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Balancing local vs. non-local corruption in GSNs

q(z|x1) q(z|x2)

x1 x2

p(x|z1)

z1

D(x)

Z

X

(a)

q(z|x1) q(z|x2)

x1 x2

p(x|z1)

z1

D(x)

Z

X

(b)

Figure: (a) the corruption process is more local, making p(x |z) simplebut causing slow mixing between modes of D in the q/p chain. (b) thecorruption process is less local, making p(x |z) tricky but causing fastmixing between modes of D in the q/p chain.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 9: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Variational Simple GSNs

We train a VAE comprising q� and p✓ by minimizing2:

Ex⇠D

E

z⇠q�(z|x)� log p✓(x |z) + �KL(q�(z |x) || p⇤(z)))

We run a Markov chain by feeding the VAE back into itself.

� lets us control E(x

1

,x2

)⇠D KL(q�(z |x2) ||q�(z |x1)) which,roughly speaking, corresponds to the locality of q�.

But, we still need better control over the dynamics of theunrolled chain (see example videos).

We’ll use Approximate Bayesian Computation to shape thedistribution G emitted by our unrolled, self-looped VAE.

2

We also penalize “squared KL above the mean” – see our paper for details.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 10: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Shaping Markov chain dynamics with ABC

Approximate Bayesian Computation is (roughly) based ontraining a generative model by minimizing some measure ofdissimilarity between G and D.

Examples include moment matching (e.g. MMD, etc.), andclassification-based methods (e.g. GANs by Goodfellow et. aland related work by Gutmann et. al).

For our models, we:

Train a guide function f to estimate log DG .

Move mass in G towards increasing f when f < 0.

Don’t move G-mass emitted in regions where f > 0.

Show a global minimum occurs i↵ 8x D(x) = G(x)3.

3

The first three points, though vague, are su�cient for this to hold.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 11: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Summarizing our model

q(z|x1) q(z|x2)

x1 x2

D(x)

Z

X

p(z)

log(D/G)

It's flat!

(a)

x0 ⇠ D(x)

x0 x1 x2

z0 z1

z0 ⇠ W(z|x0; p✓, q�)

x1 ⇠ p✓(x|z0)

f f f (xi) ⇡ logD(xi)

G(xi)

(b)

Figure: (a) the top figure gives a schematic for optimizing variationalfree-energy – the bottom figure shows valid ABC loss functions for q/pchain samples based on f ⇡ log D

G . (b) the full graph for our model.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 12: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Results

We now briefly look at:

How re-weighting the free-energy KLd term a↵ects G.How collaborative unrolling a↵ects chain dynamics.

How our models perform on di↵erent quantitative tests.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 13: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

MNIST – KLd comparison, Side-by-side

(a) � = 1, � = 0 (b) � = 4, � = 0.1 (c) � = 24, � = 0.1

Figure: Side-by-side comparison of independent samples from VAEstrained with varying strengths of KLd penalty. Scores on GPDE test were(L-to-R): 220, 265, and 330.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 14: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Chain behavior with and without guided unrolling

(a) (b)

Figure: Comparing chains generated by models learned with and withoutcollaborative shaping. The samples in (a) were generated by acorrupt-reconstruct pair q�/p✓ trained for 100k updates as a variationalauto-encoder (VAE), and then 100k updates as a 6-step unrolled,collaboratively-guided chain. Samples in (b) are from the same modelbut with 200k updates of standard VAE training.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 15: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Visualizing Markov chain dynamics – MNIST

.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Shaping Chain diversity

none

6 steps

Page 16: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Visualizing Markov chain dynamics – TFD

.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Standard free-energy Strong KL Regularization

Page 17: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Conclusion

Problem 1: if q�(z |x) is too local, the chain generated byT✓(xt+1

|xt

) = 1

C

Pz

p✓(xt+1

|z)q�(z |xt) won’t mix wellbetween separated modes of D.

I We address this using modified KL terms in the VFE.

Problem 2: When unrolled, the q/p chain may visit z notseen while training, causing it to wander away from D.

I We address this using collaborative unrolling.

Problem 3: When q�(z |x) is non-local, p✓(x |z) may need tocapture sophisticated structure, e.g. multi-modality.

I Future work – construct q/p peu a peu – see our poster/talkat EWRL or our poster at the DL workshop.

Questions?

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 18: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

References I

Yoshua Bengio, Eric Thibodeau-Laufer, Guillaume Alain, andJason Yosinski, Deep generative stochastic networks trainable

by backprop, arXiv:1306.1091v5 [cs.LG] (2014).

Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent,Generalized denoising auto-encoders as generative models,Advances in Neural Information Processing Systems (NIPS),2013.

Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, andJukka Corander, Classifier abc, MCMSki IV (posters), 2014.

, Likelihood-free inference via classification,arXiv:1407.4981v1 [stat.CO], 2014.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 19: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

References II

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, Generative adversarial nets, Advances inNeural Information Processing Systems (NIPS), 2014.

Diederik P Kingma and Max Welling, Auto-encodingvariational bayes, International Conference on LearningRepresentations (ICLR), 2014.

Danilo Rezende, Shakir Mohamed, and Daan Wierstra,Stochastic backpropagation and approximate inference in deep

generative models, International Conference on MachineLearning (ICML), 2014.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 20: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Graphical models for DAEs, GSNs, and Simple GSNs

x0 ⇠ D(x)

x0 x1 x2

x0 x1

x1 ⇠ p✓(x|x0)

x0 ⇠ q�(x|x0)

(g)

x0 x1 x2

z0 z1 z2

z0 ⇠ q�(z)

x0 ⇠ p✓(x|z0)

z1 ⇠ q�(z|x0, z0)

(h)

x0 ⇠ D(x)

x0 x1 x2

z0 z1

z0 ⇠ W(z|x0; p✓, q�)

x1 ⇠ p✓(x|z0)

(i)

Figure: (a) The Markov chain for a Generalized DAE. (b) The Markovchain for a GSN. (c) The Markov chain for a Simple GSN using acorruption process W(z |x ; q�, p✓) formed via the walkback procedure.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 21: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Wrapping p✓ and q� via Walkback

Input: data sample x , corruptor q�, reconstructor p✓Initialize an empty training pair list P

xz

= { }Set z to some initial vector 2 Z.for i = 1 to k

burn�in

do

Sample z from q�(z |x , z) then set z to z .end for

Set x to x .for i = 1 to k

roll�out

do

Sample z from q�(z |x , z) then set z to z .Sample x from p✓(x |z) then set x to x .Add pair (x , z) to P

xz

.end for

Return: Pxz

.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 22: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

GSN theory – 1

Theorem

If p✓(x |z) is a consistent estimator of the true conditional

distribution P(x |z) and the transition operator T✓(xt+1

|xt

) thatsamples z

t

from q�(zt |xt) and samples x

t+1

from p✓(xt+1

|zt

)defines an ergodic Markov chain, then as the number of examples

used to train p✓(x |z) goes to infinity (i.e. as p✓(x |z) converges toP(x |z)), the asymptotic distribution of the Markov chain with

transition operator T✓(xt+1

|xt

) converges to target distribution D.

– Theorem modified slightly from “Generalized DenoisingAuto-encoders as Generative Models” by Bengio et. al, NIPS 2013.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 23: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

GSN theory – 2

Corollary

Let X be a set in which every pair of points is connected by a

finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S

x

✓ X such that all paths between x

and any point in X \ Sx

pass through some point in Sx

whose

shortest path to x has length > 0. Suppose that

8x , 8x 0 2 Sx

[ {x}, 9zxx

0such that q�(z

xx

0 |x) > 0 and

p✓(x 0|zxx

0) > 0. Then, the Markov chain with transition operator

T✓(xt+1

|xt

) =P

z

p✓(xt+1

|z)q�(z |xt) is ergodic.

– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 24: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Variational Free-energy – 1

Assume we have the distributions p✓(x |z), p⇤(z), and q�(z |x).Then, we define the following derived distributions.

p✓(x ; p⇤)=X

z

p✓(x |z)p⇤(z) (1)

p✓(z |x ; p⇤)=p✓(x |z)p⇤(z)p✓(x ; p⇤)

(2)

p✓(x , z ; p⇤)=p✓(x |z)p⇤(z) = p✓(z |x ; p⇤)p✓(x ; p⇤) (3)

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 25: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Variational Free-energy – 2.0

log p✓(x ; p⇤) =X

z

q�(z |x) log p✓(x ; p⇤) (4)

=X

z

q�(z |x) logp✓(z |x ; p⇤)p✓(x ; p⇤)

p✓(z |x ; p⇤)(5)

=X

z

q�(z |x) logp✓(x , z ; p⇤)

p✓(z |x ; p⇤)(6)

=X

z

q�(z |x)(log p✓(x , z ; p⇤) � log q�(z |x) (7)

+ log q�(z |x) � log p✓(z |x ; p⇤))

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 26: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

GSN theory – 2

Corollary

Let X be a set in which every pair of points is connected by a

finite-length path contained in X . Suppose that for each x 2 Xthere exists a “shell” set S

x

✓ X such that all paths between x

and any point in X \ Sx

pass through some point in Sx

whose

shortest path to x has length > 0. Suppose that

8x , 8x 0 2 Sx

[ {x}, 9zxx

0such that q�(z

xx

0 |x) > 0 and

p✓(x 0|zxx

0) > 0. Then, the Markov chain with transition operator

T✓(xt+1

|xt

) =P

z

p✓(xt+1

|z)q�(z |xt) is ergodic.

– Corollary modified slightly from “Deep Generative StochasticNetworks Trainable by Backprop” by Bengio et. al, ICML 2014.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 27: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

MNIST – Weak KLd penalty

(a) (b)

Figure: Models trained with � = 1 and � = 0.0. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 28: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

MNIST – Medium KLd penalty

(a) (b)

Figure: Models trained with � = 4 and � = 0.1. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 29: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

MNIST – Strong KLd penalty

(a) (b)

Figure: Models trained with � = 24 and � = 0.1. The model in (a) wastrained only as a VAE and the model in (b) was trained as a VAE andthen as an unrolled Markov chain with collaborative guidance.

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 30: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Training progress comparison – MNIST

30k 60k 120k

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 31: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Training progress comparison – TFD

50k 100k 150k

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 32: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Multi-test – TFD

(d)

(e) (f)

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping

Page 33: Variational Generative Stochastic Networks with ......Results We now briefly look at: How re-weighting the free-energy KLd term a ects G. How collaborative unrolling a ects chain

Slides

Multi-test – MNIST

(a)

(b) (c)

Philip Bachman and Doina Precup Variational GSNs with Collaborative Shaping