deep generative models i
TRANSCRIPT
Deep Generative Models IVariational Autoencoders β Part II
Machine Perception
Otmar Hilliges
23 April 2020
Last week(s)
Intro to generative modelling
VAE
Derivation of the ELBO
4/23/2020 2
This week
More generative models
Step I: Variatonal autoencoders: β’ Training
β’ Representation learning & disentanglement
β’ Applications
Step II: Generative adverserial networks
4/23/2020 3
Generative Modelling
Given training data, generate new samples, drawn from βsameβ distribution
We want ππππππ(π₯) to be similar to ππππ‘π(π₯)
4/23/2020 4
Training samples ~ππππ‘π(π₯) Generated samples ~ππππππ(π₯)
Variational autoencoder
4/23/2020 5
π₯
Inference modelGenerative model
π§
Recap: ELBO
Ξπ§ log πΞ π₯ π π§ ] β π·πΎπΏ(ππ π§ π₯ π ) β₯ πΞ π§ π )
4/23/2020 6
β(π₯ π , Ξ, π)
π
π π π πΊ π π
Sample π§|π₯~π© π π§ π₯ , Ξ£ π§ π₯
π
π π π πΊ π π
Sample x|π§~π© π π₯ π§ , Ξ£ π₯ π§
ΰ·π
Encoder networkππ π§ π₯
Decoder networkπΞ π₯ π§
Make approximate posteriorSimilar to prior
Maximize reconstruction likelihood
Training VAEs
Training of VAEs is just backprop.
But wait, how do gradients flow through π§?
Or any other random operation (requiring sampling)?
(see blackboard)
4/23/2020 7
4/23/2020 8
Generating Data
4/23/2020 10
π
π π π πΊ π π
Sample x|π§~π© π π₯ π§ , Ξ£ π₯ π§
ΰ·π
Decoder networkπΞ π₯ π§
At runtime only use decoder network. Sample z from prior.
Sample π§~π© 0, πΌ
[Kingma and Welling, βAuto-Encoding Variational Bayesβ, ICLR 2014]
Cross-modal Deep Variational Hand Pose Estimation
Adrian Spurr, Jie Song, Seonwook Park, Otmar Hilliges
CVPR 2018
RGB hand pose estimation
4/23/2020 12
Assumption: Manifold of valid hand poses
Adopted from [Tagliasacchi et al. 2015]
Cross-modal latent space β Inference model
Cross-modal latent space β Generative model
+++
[Spurr et al.: CVPR β18]
Cross-modal training objective
Original VAE considers only one modality β need to modify to consider multiple modalities
Consider π₯π and π₯π‘ samples from different modalities (e.g RGB and 3D joint positions)
Important: Both describe inherently the same thing. In our case, the hand pose.
Task: maximize log ππ(π₯π‘) under π₯π
17
[Spurr et al.: CVPR β18]
Cross-modal training objective derivation
log π π₯π‘ = π§ π π§ π₯π log π π₯π‘ ππ§ = π§ π π§ π₯π logπ π₯π‘ π π§ π₯π‘ π(π§|π₯π)
π π§ π₯π‘ π(π§|π₯π)ππ§ (Bayesβ Rule)
= π§ π π§ π₯π logπ π§ π₯ππ π§ π₯π‘
ππ§ + π§ π π§ π₯π logπ π₯π‘ π(π§|π₯π‘)
π(π§|π₯π)ππ§ (Logarithms)
= π·πΎπΏ(π(π§|π₯π)||π π§ π₯π‘ ) + π§ π π§ π₯π logπ π₯π‘ π§ π π§
π π§ π₯πππ§ (Definition KL)
β₯ π§ π π§ π₯π log π π₯π‘ π§ ππ§ β π§ π π§ π₯π logπ π§ π₯ππ π§
ππ§ ( logπ₯
π¦=β log
π¦
π₯)
= πΌπ§~π π§ π₯π [log π π₯π‘ π§)] β π·πΎπΏ(π(π§|π₯π)||π π§ ) (πΌ over z, KL)
18
[Spurr et al.: CVPR β18]
Trained individually Trained cross-modal
t-SNE embedding comparisons[Spurr et al.: CVPR β18]
RGB Input Ground Truth Prediction
Posterior estimates: RGB (real) to 3D joints[Spurr et al.: CVPR β18]
Synthetic hands: Latent Space Walk
[Synthetic samples RGB] [Synthetic samples depth]
[Spurr et al.: CVPR β18]
Learning useful representations
4/23/2020 22
Goal: Learn features that capture similarities and dissimilarities Requirement: Objective that defines notion of utility (task-dependent)
Autoencoders
4/23/2020 23
X Z π
f g
Notion of utility: Ability to reconstruct pixels from code (features are a compressed representation of original data)
Autoencoders
4/23/2020 24
X Z
f
Entangled Representation: Individual dimensions in code encode some unknown combination of features in the data.
Learned representation
Goal: Disentangled Representations
4/23/2020 25
Digit
Styl
e
Input EncoderFactorized
Code
Goal: Learn features that correspond to distinct factors of variation One Notion of Utility: statistical independence
Regular vs Variational Autoencoders
4/23/2020 26
AE (z-dim 50, TSNE) VAE (z-dim 50, TSNE)
Issue: KL-Divergence regularizes values of z butRepresentations are still entangled
Unsupervised vs Semi-supervised
4/23/2020 27
[image source: Babak Esmaeli]
[Kingma et al. Semi-Supervised Learning with Deep Generative Models. NIPS 2014]
Digit
Styl
e
Separate label π¦ from βnuisanceβ variable π§
Learning factorized representations (unsupervised)
4/23/2020 28
[image source: Emile Mathieu]
π½-VAE
Goal: Learn disentangled representation without supervision
Idea: Provide framework for automated discovery of interpretable factorised latent representations
Approach: modification of the VAE, introducing an adjustable hyperparameter beta that balances latent channel capacity and independence constraints with reconstruction accuracy.
4/23/2020 29
[Higgins et al., βLearning Basic Visual Concepts with a Constrained Variational Frameworkβ, ICLR 2017]
4/23/2020 30
π½-VAE [Higgins et al., βLearning Basic Visual Concepts with a Constrained Variational Frameworkβ, ICLR 2017]
π½-VAE
4/23/2020 31
[Higgins et al., βLearning Basic Visual Concepts with a Constrained Variational Frameworkβ, ICLR 2017]
Conditionally independent factors
Conditionally dependent factors
Assume that samples x are generated by controlling the generative factors v and w.
4/23/2020 32
π½-VAE β Training Objective [Higgins et al., βLearning Basic Visual Concepts with a Constrained Variational Frameworkβ, ICLR 2017]
maxπ,π
πΌπ₯~π πΌπ§~ππ(π§|π₯) log ππ(π₯|π§)
subject to DKL ππ π§ π₯ β₯ ππ π§ < πΏ
4/23/2020 33
π½-VAE β Training Objective [Higgins et al., βLearning Basic Visual Concepts with a Constrained Variational Frameworkβ, ICLR 2017]
Entangled versus Disentangled Representations
4/23/2020 34
π½-VAE vanilla VAE
Disentangling Latent Hands for Image Synthesis and Pose Estimation
Linlin Yang, Angela Yao
CVPR 2019
Disentangling latent factors
4/23/2020 36
[Yang et al.: CVPR β19]
Disentangling latent factors
4/23/2020 37
[Yang et al.: CVPR β19]
πΏ ππ₯, ππ¦1 , ππ¦2 , ππ₯, ππ¦1 , ππ¦2 = πΈπΏπ΅ππππ π₯, π¦1, π¦2, ππ¦1 , ππ¦2 , ππ₯, ππ¦1 , ππ¦2+ πΈπΏπ΅ππππ(π₯, π¦1, π¦2, ππ₯).
where
log π π₯, π¦1, π¦2 β₯ πΈπΏπ΅ππππ π₯, π¦1, π¦2, ππ¦1 , ππ¦2 , ππ₯, ππ¦1 , ππ¦2= ππ₯πΌπ§~πππ¦1,ππ¦2
log πππ₯ π₯ π§
+ ππ¦1πΌπ§π¦1~πππ¦1log πππ¦1 π¦1 π§π¦1
+ ππ¦2πΌπ§π¦2~πππ¦2log πππ¦2 π¦2 π§π¦2
β π½π·πΎπΏ πππ¦1 ,ππ¦2π§ π¦1, π¦2 ||π π§
and
πΈπΏπ΅ππππ π₯, π¦1, π¦2, ππ₯ = πβ²π₯πΌπ§~πππ₯ log πππ₯ π₯ π§
+ πβ²π¦1πΌπ§π¦1~πππ₯ log πππ¦1 π¦1 π§π¦1
+ πβ²π¦2πΌπ§π¦2~πππ₯ log πππ¦2 π¦2 π§π¦2
β π½β²π·πΎπΏ πππ₯(π§|π₯)||π π§
Latenspace walk with disentangled z
4/23/2020 38
Latenspace walk with disentangled z
4/23/2020 39
Limitations of VAEs
4/23/2020 40
Tendency to generate blurry images.
Believed to be due to injected noise and weak inference models (Gaussian assumption of latent samples to simplistic, model capacity to weak)
More expressive model βsubstantially better results (e.gKingma et al. 2016, Improving variational inference with inverse autoregressive flow)
Next
Deep Generative Modelling II:
Generative Adversarial Networks (GANs)
4/23/2020 41