a review of deep models for density estimation
TRANSCRIPT
A Review of Deep Models for Density Estimation
Andrew McHutchon & Mark van der WilkThursday 13 March, 2014
Introduction
I Justifying deep architectures (intuition and a theorem)
I Where it started: Deep RBMs
I The Bayesian approach: Deep GPs
I A different approach: Deep Density Models
I A hybrid: Deep Latent Gaussian Models
I Tying it together
2 of 34
Justifying deep architectures
I Some functions cannot be “efficiently” represented by shallowarchitectures
I Shallow architectures generalise locally, while deeprepresentations allow more global generalisation
I Multi-task learning. Low level features can be useful for severaltasks.
I Empirical performance
[Bengio, 2009]
3 of 34
Efficiency: Linear threshold networks
TheoremA monotone weighted thresholdcircuit of depth k − 1 computing afunction fk ∈ Fk ,N has at leastsize 2cN for some constant c > 0and N > No. A depth k circuitneeds only polynomial size.[Hastad and Goldmann, 1991]
I Proof is limited to one type of NN and one class of functions...But it does hint that there can be some advantage to usingdeep architectures.
I Is this relevant to “infinite capacity” non-parametric models(GP’s)?
4 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Generalisation
Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.
To a certain extent, this is a simplification of something we alreadyknow.
I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.
I Good features can greatly improve generalisation.
I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).
I Finding good features/kernels is hard
I Learning deep representations may be a way of doing this.
5 of 34
Deep Belief Networks
Belief net:I Directed acyclic graphI Sigmoid belief net [Neal (1992)]
I Binary nodesI p(y ,h) =
∏Di=1 Ber(yi | σ(wT h))∏E
j=1 Ber(hj | w2j )
Restricted Boltzman machine:I Undirected bipartite graphI Binary nodes
I p(y , h) =exp(hT Wy + aT y + bT h)
Z
6 of 34
Deep Belief Networks
Deep directed net: Training verydifficult due to “explaining away”effect
I Posterior on hidden nodesdoes not factorise
I MCMC - slowI Variational - too approximate
7 of 34
Deep Belief Networks
Deep Belief net:I Hinton, Osindero & Teh,
Neural Computation (2006)I RBM acts as a
“complementary prior” toremove “explaining away”
9 of 34
Deep Belief Networks
Training:1. Iteratively train each layer in a greedy manner2. Fine tuning with and up-down algorithm on subset of data3. Cross validation step4. Further training
11 of 34
Deep Gaussian Processes
I Gaussian Process Latent Variable Model[Lawrence (2004)]
I Bayesian Gaussian Process Latent Variable Models[Titsias and Lawrence (2010)]
I Deep Gaussian Processes[Damianou and Lawrence (2013)]
12 of 34
Deep Gaussian Processes
Z
F X
X
F Y
Y
GP
+εx
GP
+εy
F x ∼ GP (0, kx (Z , Z )) (1)
xn = f xn + εx
n (2)
F y ∼ GP (0, kx (X , X )) (3)
yn = f yn + ε
yn (4)
14 of 34
Deep Gaussian Processes
log p(Y ) = log∫
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
(5)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
Q(6)
F ≤ log p(Y ) (7)
15 of 34
Deep Gaussian Processes
log p(Y ) = log∫
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
(5)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
Q(6)
F ≤ log p(Y ) (7)
15 of 34
Deep Gaussian Processes
Z
F X
X
F Y
Y
Z
UX
X
UY
GP
+εx
GP
+εy
p(F X | Z ) = p(F X | UX , Z , Z ) p(UX | Z ) (8)
p(F Y | X ) = p(F Y | UY , X , X ) p(UY | X )(9)
16 of 34
Deep Gaussian Processes
Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )
=
∫Q log
p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )
q(UY )q(X )q(UZ )q(Z )
= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))
(11)
17 of 34
Deep Gaussian Processes
Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )
=
∫Q log
p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )
q(UY )q(X )q(UZ )q(Z )
= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))
(11)
17 of 34
Deep Gaussian Processes
Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )
=
∫Q log
p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )
q(UY )q(X )q(UZ )q(Z )
= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))
(11)
17 of 34
Deep Gaussian Processes
Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)
F =
∫Q log
p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )
p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )
=
∫Q log
p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )
q(UY )q(X )q(UZ )q(Z )
= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))
(11)
17 of 34
Deep Gaussian Processes
ParametersI M D-dimensional pseudo points per layerI D + 2 hyperparameters per layerI D + 1 variational parameters per layer
ComplexityI O(NM2L) trainingI O(M2L) generative (with precomputations)
19 of 34
Deep Density Model (DDM)
Introduced by Rippel and Adams [2013].
Main idea:I Find an invertible mapping y = f(x) such that p(x) has
independent dimensions with given marginals:p(x) =
∏Kk=1 pXk (xk )
I i.e. transform the data such that the distribution in the“representation space” is known.
Consequences:
I This makes the density over y easy to calculate:p(y) =
∏Kk=1 pXk (f−1(y))
∣∣∣∂f(x)∂x
∣∣∣f−1(y)
I Normalised and fast to compute p(y∗)!
20 of 34
Deep Density Model (DDM)
Introduced by Rippel and Adams [2013].
Main idea:I Find an invertible mapping y = f(x) such that p(x) has
independent dimensions with given marginals:p(x) =
∏Kk=1 pXk (xk )
I i.e. transform the data such that the distribution in the“representation space” is known.
Consequences:
I This makes the density over y easy to calculate:p(y) =
∏Kk=1 pXk (f−1(y))
∣∣∣∂f(x)∂x
∣∣∣f−1(y)
I Normalised and fast to compute p(y∗)!
20 of 34
DDM: Parameterisation
The decoder f(x) and its inverse g(y) ≈ f−1(y) = x are bothparameterised by deep neural networks:
fΘ(x) = σ(ΩM ·+ωM) σ(ΩM−1 ·+ωM−1) · · · σ(Ω1x + ω1)gΨ(y) = σ(ΓM ·+γM) σ(ΓM−1 ·+γM−1) · · · σ(Γ1y + γ1)
We find g ≈ f−1 separately because this has regularisation benefits(discussed shortly).
21 of 34
DDM: Objective Function
The cost function:C(Θ,Ψ) = µDD(Ψ) + µII(Θ) + µRR(Θ,Ψ)
I D(Ψ): Divergence penalty
I I(Θ): Invertability penalty
I R(Θ,Ψ): Reconstruction loss
22 of 34
DDM: Divergence penalty
Forces the model to distribute mass of the data in therepresentation space similarly to the desired distribution p(x).
D(Ψ) =(
1K∑K
k=1 T (pXk (·)||pXk (·)))
+(
1N∑N
n=1 T (pX (xn)||pX (xn)))
pXk (x) = 1N∑N
n=1 δ(x − [xn]k )
23 of 34
DDM: Invertability penalty
I(Θ) = 1M∑M
m=1 log(λmax (Ωm)λmin(Ωm)
)Ensures invertibility, so p(y) is well defined.
24 of 34
DDM: Reconstruction loss
I Ensures that gΨ(y) ≈ f−1Θ (y) - Information is preserved.
I As a consequence, marginals are forced to becomeindependent.
H(pY(·)) =K∑
k=1
H(pXk (·))+
− D(K∏
k=1
pXk (·)||pX(·))+
+ E[log∣∣∣∣∂fΘ(·)∂y
∣∣∣∣]
25 of 34
Deep Latent Gaussian Models
Claims to be:I Deep - for capturing higher moments of the data
I Non-linear - Allowing for complex structure in the data
I Fast - in terms of sampling fantasy data
I Scalable - in terms of the size of training set
Also provides a nice link to autoencoders.
27 of 34
DLGM: The Model
Generative process
ξl ∼ N (ξl |0, I) (12)hL = GLξL (13)hl = Tl(hl+1) + GLξl (14)v ∼ p(v|T0(h1)) (15)
I Transformation Tl is a multilayer perceptron
I Parameters θ contains the Gl and the MLP parameters
28 of 34
DLGM: Stochastic Back-Propagation
I Variationally integrate out ξ:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]
I Need gradients w.r.t. θg and the variational parameters
The trick for the variational parameters:
I ∇µlEq [log p(V |ξ,θg)] = E[∇ξl
log p(v|h(ξ))]
I Backpropagation gives the gradients given ξ
I Sampling can give an unbiased estimate of the gradient
29 of 34
DLGM: Stochastic Back-Propagation
I Variationally integrate out ξ:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]
I Need gradients w.r.t. θg and the variational parameters
The trick for the variational parameters:
I ∇µlEq [log p(V |ξ,θg)] = E[∇ξl
log p(v|h(ξ))]
I Backpropagation gives the gradients given ξ
I Sampling can give an unbiased estimate of the gradient
29 of 34
DLGM: Training
Training buzzwords: “Learning to learn Variational StochasticBack-Propagation”
I Estimate µl & Cl using therecognition model
I Sample ξ for each layer sothe h’s can be calculated
I Calculate the gradientestimates w.r.t. θg
I Calculate the gradientestimates w.r.t. θr
30 of 34
DLGM: Similarity to Autoencoders
Autoencoder objective function:L(θ) = 1
n∑
V∼p(V ),V∼C(V |V ) λnΩ(θ,V , V )− log Pθ(V |V )
Deep Latent Gaussian Model:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]
I Autoencoders aim to maximise the probability of reconstructinga corrupted sample
I The noisy sample V is analogous to the latent Gaussians ξ
I The learned posterior q(ξ|v) is analogous to the encoder
31 of 34
Comparison
I All: Attempt to map the observed space to some latent spacewith an easy density
I Some proofs about why this may be good: Gaussianisation
I Latent dimension different for all
I Inference varies wildly
32 of 34
References
Y. Bengio. Learning deep architectures for ai. Found. Trends Mach.Learn., 2(1):1–127, January 2009. ISSN 1935-8237. doi:10.1561/2200000006. URLhttp://dx.doi.org/10.1561/2200000006.
Johan Hastad and Mikael Goldmann. On the power of small-depththreshold circuits. computational complexity, 1(2):113–129, 1991.ISSN 1016-3328. doi: 10.1007/BF01272517. URLhttp://dx.doi.org/10.1007/BF01272517.
Oren Rippel and Ryan Prescott Adams. High-dimensionalprobability estimation with deep density models, 2013.
34 of 34