004 20151116 deep_unsupervisedlearningusingnonequlibriumthermodynamics
TRANSCRIPT
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Jascha Sohl-‐Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli Proceedings of the 32nd InternaLonal Conference on Machine Learning, 2015
Tran Quoc Hoan Paper alert@2015-‐11-‐16
IntroducLon
• Abstract: “…The essen)al idea, inspired by non-‐equilibrium sta)s)cal
physics, is to systema)cally and slowly destroy structure in a
data distribu)on through an itera)ve forward diffusion
process. We then learn a reverse diffusion process that
restores structure in data, yielding a highly flexible and
tractable genera)ve model of the data…”
2/30
Outline
• MoLvaLon
-‐ The promise of deep unsupervised learning
• Physical intuiLon -‐ Diffusion processes and )me reversal
• Diffusion probabilisLc model
-‐ Deriva)on and experimental results
3/30
Deep Unsupervised Learning
• Unknown features/labels – Novel modaliLes – Exploratory data analysis
4/30
• Expensive labels
• Unpredictable tasks / one shot learning
Physical IntuiLon
• Diffusion processes and Lme reversal
– Destroy structure in data
– Carefully characterize the destrucLon
– Learn how to reverse Lme
5/30
ObservaLon 1: Diffusion Destroys Structure
6/30
(ObservaLon) Diffusion destroys structure
Data distribuLon Uniform distribuLon
Data distribuLon Uniform distribuLon (Recover structure)
Recover data distribuLon by starLng from uniform distribuLon and running dynamics backwards
ObservaLon 2: Microscopic Diffusion
• Time reversible
• Brownian moLon
• PosiLon updates are small Gaussians (both forwards and backwards in
)me)
7/30
h_ps://www.youtube.com/watch?v=cDcprgWiQEY
Diffusion-‐based ProbabilisLc Models
• Destroy all structure in data distribuLon using diffusion process
8/30
• Learn reversal of diffusion process – EsLmate funcLon for mean and covariance of each step in the reverse diffusion process (Ex. Binomial rate for binary data)
• Reverse diffusion process is the model of the data
Diffusion-‐based ProbabilisLc Models
• Algorithm
9/30
• MulLplying distribuLons: inputaLon, denoising, compuLng posteriors
• Deep convoluLonal network: universal funcLon approximator
Destroy by Diffusion Process
10/30
Data distribuLon
Forward diffusion
Noise distribuLon
Temporal diffusion rate
Destroy by Gaussian Diffusion Process
11/30
Data distribuLon
Forward diffusion
Noise distribuLon
Decay towards origin Add small noise
Reversal Gaussian Diffusion Process
12/30
Data distribuLon
Reverse diffusion
Noise distribuLon
Learned drid and covariance funcLons
Training the Reverse Diffusion
14/30
Model probability
Annealed importance sampling / Jarzynski equality
Training the Reverse Diffusion
17/30
…for Gaussian diffusion process…
Training
Unsupervised learning Regression
Training the Reverse Diffusion
18/30
Segng the diffusion rate βt
• For Gaussian diffusion
• For Binomial diffusion (erase constant fracLon of sLmulus variance each step)
β1 = small constant (prevent over-‐figng) Training βt
MulLplying DistribuLons
19/30
Interested in
• Required to compute posterior distribuLon – Missing data (inpainLng) – Corrupted data (denoising)
• Difficult and expensive using compeLng techniques – Ex. VAE, GSNs, NADEs, most graphical models
Acts as small perturbaLon to diffusion process
MulLplying DistribuLons
20/30
Interested in
• Modified marginal distribuLons
Acts as small perturbaLon to diffusion process
MulLplying DistribuLons
21/30
• Modified diffusion steps Equilibrium condiLon
Corresponding normalized distribuLon
MulLplying DistribuLons
22/30
Interested in
Acts as small perturbaLon to diffusion process
Reversal Gaussian Diffusion Process
Small perturbaLon affects only mean
Applied to CIFAR-‐10
25/30
Training Data Samples from GeneraLve Adversarial [Goodfellow et al, 2014]
Samples from diffusion model
Applied to CIFAR-‐10
26/30
Samples from DRAW
[Gregor et al, 2015]
Samples from GeneraLve Adversarial [Goodfellow et al, 2014]
Samples from diffusion model
Applied to Dead Leaves
27/30
Training data Samples from [Theis et al, 2012] Log likelihood 1.24 bits/pixel
Samples from diffusion model Log likelihood 1.49 bits/pixel