a review of deep models for density estimation

47
A Review of Deep Models for Density Estimation Andrew McHutchon & Mark van der Wilk Thursday 13 March, 2014

Upload: others

Post on 04-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

A Review of Deep Models for Density Estimation

Andrew McHutchon & Mark van der WilkThursday 13 March, 2014

Introduction

I Justifying deep architectures (intuition and a theorem)

I Where it started: Deep RBMs

I The Bayesian approach: Deep GPs

I A different approach: Deep Density Models

I A hybrid: Deep Latent Gaussian Models

I Tying it together

2 of 34

Justifying deep architectures

I Some functions cannot be “efficiently” represented by shallowarchitectures

I Shallow architectures generalise locally, while deeprepresentations allow more global generalisation

I Multi-task learning. Low level features can be useful for severaltasks.

I Empirical performance

[Bengio, 2009]

3 of 34

Efficiency: Linear threshold networks

TheoremA monotone weighted thresholdcircuit of depth k − 1 computing afunction fk ∈ Fk ,N has at leastsize 2cN for some constant c > 0and N > No. A depth k circuitneeds only polynomial size.[Hastad and Goldmann, 1991]

I Proof is limited to one type of NN and one class of functions...But it does hint that there can be some advantage to usingdeep architectures.

I Is this relevant to “infinite capacity” non-parametric models(GP’s)?

4 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Generalisation

Bengio: Shallow architectures (kernel learning) generalise locally,while deep representations allow more global generalisation.

To a certain extent, this is a simplification of something we alreadyknow.

I Bengio’s point: Squared Exponential is very widespread in use,but provides only local generalisation.

I Good features can greatly improve generalisation.

I A feature mapping can be incorporated into a the kernel (likethe periodic kernel).

I Finding good features/kernels is hard

I Learning deep representations may be a way of doing this.

5 of 34

Deep Belief Networks

Belief net:I Directed acyclic graphI Sigmoid belief net [Neal (1992)]

I Binary nodesI p(y ,h) =

∏Di=1 Ber(yi | σ(wT h))∏E

j=1 Ber(hj | w2j )

Restricted Boltzman machine:I Undirected bipartite graphI Binary nodes

I p(y , h) =exp(hT Wy + aT y + bT h)

Z

6 of 34

Deep Belief Networks

Deep directed net: Training verydifficult due to “explaining away”effect

I Posterior on hidden nodesdoes not factorise

I MCMC - slowI Variational - too approximate

7 of 34

Deep Belief Networks

Deep Boltzmann machine:Sulakhutinov & Hinton, AISTATS(2009)

8 of 34

Deep Belief Networks

Deep Belief net:I Hinton, Osindero & Teh,

Neural Computation (2006)I RBM acts as a

“complementary prior” toremove “explaining away”

9 of 34

Deep Belief Networks

W

W T

W

Posterior on hidden nodesfactorises

=

10 of 34

Deep Belief Networks

W

W T

W

Posterior on hidden nodesfactorises

=

10 of 34

Deep Belief Networks

Training:1. Iteratively train each layer in a greedy manner2. Fine tuning with and up-down algorithm on subset of data3. Cross validation step4. Further training

11 of 34

Deep Gaussian Processes

I Gaussian Process Latent Variable Model[Lawrence (2004)]

I Bayesian Gaussian Process Latent Variable Models[Titsias and Lawrence (2010)]

I Deep Gaussian Processes[Damianou and Lawrence (2013)]

12 of 34

Deep Gaussian Processes

Z

XH−1

X1

Y

GP

GP

GP

GP

13 of 34

Deep Gaussian Processes

Z

X

Y

GP

GP

13 of 34

Deep Gaussian Processes

Z

F X

X

F Y

Y

GP

+εx

GP

+εy

F x ∼ GP (0, kx (Z , Z )) (1)

xn = f xn + εx

n (2)

F y ∼ GP (0, kx (X , X )) (3)

yn = f yn + ε

yn (4)

14 of 34

Deep Gaussian Processes

log p(Y ) = log∫

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

(5)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

Q(6)

F ≤ log p(Y ) (7)

15 of 34

Deep Gaussian Processes

log p(Y ) = log∫

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

(5)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

Q(6)

F ≤ log p(Y ) (7)

15 of 34

Deep Gaussian Processes

Z

F X

X

F Y

Y

Z

UX

X

UY

GP

+εx

GP

+εy

p(F X | Z ) = p(F X | UX , Z , Z ) p(UX | Z ) (8)

p(F Y | X ) = p(F Y | UY , X , X ) p(UY | X )(9)

16 of 34

Deep Gaussian Processes

Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )

=

∫Q log

p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )

q(UY )q(X )q(UZ )q(Z )

= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))

(11)

17 of 34

Deep Gaussian Processes

Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )

=

∫Q log

p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )

q(UY )q(X )q(UZ )q(Z )

= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))

(11)

17 of 34

Deep Gaussian Processes

Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )

=

∫Q log

p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )

q(UY )q(X )q(UZ )q(Z )

= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))

(11)

17 of 34

Deep Gaussian Processes

Q = p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z ) (10)

F =

∫Q log

p(Y | F Y )p(F Y | X )p(X | F X )p(F X | Z )p(Z )

p(F Y | UY , X )q(UY )q(X )p(F Z | UZ , Z )q(UZ )q(Z )

=

∫Q log

p(Y | F Y )p(UY )p(X | F X )p(UX )p(Z )

q(UY )q(X )q(UZ )q(Z )

= gY + rX + H(q(X )) − KL(q(Z ) ‖ p(Z ))

(11)

17 of 34

Deep Gaussian Processes

I USPS digits 0,1,6I 5 layersI 150 training data points!!!

18 of 34

Deep Gaussian Processes

ParametersI M D-dimensional pseudo points per layerI D + 2 hyperparameters per layerI D + 1 variational parameters per layer

ComplexityI O(NM2L) trainingI O(M2L) generative (with precomputations)

19 of 34

Deep Density Model (DDM)

Introduced by Rippel and Adams [2013].

Main idea:I Find an invertible mapping y = f(x) such that p(x) has

independent dimensions with given marginals:p(x) =

∏Kk=1 pXk (xk )

I i.e. transform the data such that the distribution in the“representation space” is known.

Consequences:

I This makes the density over y easy to calculate:p(y) =

∏Kk=1 pXk (f−1(y))

∣∣∣∂f(x)∂x

∣∣∣f−1(y)

I Normalised and fast to compute p(y∗)!

20 of 34

Deep Density Model (DDM)

Introduced by Rippel and Adams [2013].

Main idea:I Find an invertible mapping y = f(x) such that p(x) has

independent dimensions with given marginals:p(x) =

∏Kk=1 pXk (xk )

I i.e. transform the data such that the distribution in the“representation space” is known.

Consequences:

I This makes the density over y easy to calculate:p(y) =

∏Kk=1 pXk (f−1(y))

∣∣∣∂f(x)∂x

∣∣∣f−1(y)

I Normalised and fast to compute p(y∗)!

20 of 34

DDM: Parameterisation

The decoder f(x) and its inverse g(y) ≈ f−1(y) = x are bothparameterised by deep neural networks:

fΘ(x) = σ(ΩM ·+ωM) σ(ΩM−1 ·+ωM−1) · · · σ(Ω1x + ω1)gΨ(y) = σ(ΓM ·+γM) σ(ΓM−1 ·+γM−1) · · · σ(Γ1y + γ1)

We find g ≈ f−1 separately because this has regularisation benefits(discussed shortly).

21 of 34

DDM: Objective Function

The cost function:C(Θ,Ψ) = µDD(Ψ) + µII(Θ) + µRR(Θ,Ψ)

I D(Ψ): Divergence penalty

I I(Θ): Invertability penalty

I R(Θ,Ψ): Reconstruction loss

22 of 34

DDM: Divergence penalty

Forces the model to distribute mass of the data in therepresentation space similarly to the desired distribution p(x).

D(Ψ) =(

1K∑K

k=1 T (pXk (·)||pXk (·)))

+(

1N∑N

n=1 T (pX (xn)||pX (xn)))

pXk (x) = 1N∑N

n=1 δ(x − [xn]k )

23 of 34

DDM: Invertability penalty

I(Θ) = 1M∑M

m=1 log(λmax (Ωm)λmin(Ωm)

)Ensures invertibility, so p(y) is well defined.

24 of 34

DDM: Reconstruction loss

I Ensures that gΨ(y) ≈ f−1Θ (y) - Information is preserved.

I As a consequence, marginals are forced to becomeindependent.

H(pY(·)) =K∑

k=1

H(pXk (·))+

− D(K∏

k=1

pXk (·)||pX(·))+

+ E[log∣∣∣∣∂fΘ(·)∂y

∣∣∣∣]

25 of 34

DDM: Results

I 1.6% error rate on MNIST

26 of 34

Deep Latent Gaussian Models

Claims to be:I Deep - for capturing higher moments of the data

I Non-linear - Allowing for complex structure in the data

I Fast - in terms of sampling fantasy data

I Scalable - in terms of the size of training set

Also provides a nice link to autoencoders.

27 of 34

DLGM: The Model

Generative process

ξl ∼ N (ξl |0, I) (12)hL = GLξL (13)hl = Tl(hl+1) + GLξl (14)v ∼ p(v|T0(h1)) (15)

I Transformation Tl is a multilayer perceptron

I Parameters θ contains the Gl and the MLP parameters

28 of 34

DLGM: Stochastic Back-Propagation

I Variationally integrate out ξ:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]

I Need gradients w.r.t. θg and the variational parameters

The trick for the variational parameters:

I ∇µlEq [log p(V |ξ,θg)] = E[∇ξl

log p(v|h(ξ))]

I Backpropagation gives the gradients given ξ

I Sampling can give an unbiased estimate of the gradient

29 of 34

DLGM: Stochastic Back-Propagation

I Variationally integrate out ξ:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]

I Need gradients w.r.t. θg and the variational parameters

The trick for the variational parameters:

I ∇µlEq [log p(V |ξ,θg)] = E[∇ξl

log p(v|h(ξ))]

I Backpropagation gives the gradients given ξ

I Sampling can give an unbiased estimate of the gradient

29 of 34

DLGM: Training

Training buzzwords: “Learning to learn Variational StochasticBack-Propagation”

I Estimate µl & Cl using therecognition model

I Sample ξ for each layer sothe h’s can be calculated

I Calculate the gradientestimates w.r.t. θg

I Calculate the gradientestimates w.r.t. θr

30 of 34

DLGM: Similarity to Autoencoders

Autoencoder objective function:L(θ) = 1

n∑

V∼p(V ),V∼C(V |V ) λnΩ(θ,V , V )− log Pθ(V |V )

Deep Latent Gaussian Model:L(V ) = −DKL(q(ξ)||p(ξ)) + Eq [log p(V |ξ,θg)]

I Autoencoders aim to maximise the probability of reconstructinga corrupted sample

I The noisy sample V is analogous to the latent Gaussians ξ

I The learned posterior q(ξ|v) is analogous to the encoder

31 of 34

Comparison

I All: Attempt to map the observed space to some latent spacewith an easy density

I Some proofs about why this may be good: Gaussianisation

I Latent dimension different for all

I Inference varies wildly

32 of 34

What would be cool

33 of 34

References

Y. Bengio. Learning deep architectures for ai. Found. Trends Mach.Learn., 2(1):1–127, January 2009. ISSN 1935-8237. doi:10.1561/2200000006. URLhttp://dx.doi.org/10.1561/2200000006.

Johan Hastad and Mikael Goldmann. On the power of small-depththreshold circuits. computational complexity, 1(2):113–129, 1991.ISSN 1016-3328. doi: 10.1007/BF01272517. URLhttp://dx.doi.org/10.1007/BF01272517.

Oren Rippel and Ryan Prescott Adams. High-dimensionalprobability estimation with deep density models, 2013.

34 of 34