variational dropout and the local …lcarin/changwei1.29.2016.pdf2016/01/29  · main idea contents...

16
Variational Dropout and the Local Reparameterization Trick Diederik Kingma, Tim Salimans, Max Welling Presented by: Changwei Hu Jan 29, 2016 1 / 16

Upload: others

Post on 07-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Variational Dropout and the Local Reparameterization Trick

Diederik Kingma, Tim Salimans, Max Welling

Presented by: Changwei Hu

Jan 29, 2016

1 / 16

Page 2: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Main Idea

When variance of gradients is large, stochastic gradientdescent may fail

Propose a SGVB estimator that has variance inverselyproportional to minibatch size

Use local parameterization trick to make estimatorcomputationally efficient

Propose variational dropout under the framework ofvariational inference

Dropout rate is learned instead of being fixed

2 / 16

Page 3: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Contents

1 Background

2 Local Reparameterization Trick

3 Variational Dropout

4 Experimental Results

3 / 16

Page 4: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Variational Inference

Optimize variational parameter φ of some parameterizedmodel qφ(w) such that qφ(w) is a close approximation totrue posterior p(w |D).w : parameters or weights for the model; D: data

In practice, maximize variational lower bound L(φ) of themarginal likelihood of the data:

L(φ) = −DKL(qφ(w)||p(w)) + LD(φ) (1)

where LD(φ) =∑

(x ,y)∈D

Eqφ(w)[log p(y |x ,w)] (2)

LD(φ): expected log-likelihood(x , y) ∈ D: observation of tuples

4 / 16

Page 5: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Stochastic Gradient Variational Bayes (SGVB)

SGVB parameterizes random parameters w ∼ qφ(w) asw = f (ε,φ).f (·): differentiable function; ε ∼ p(ε): random noise variable

Unbiased minibatch-based Monte Carlo estimator of expectedlog-likelihood can be formed:

LD(φ) ' LSGVBD (φ) =N

M

M∑i=1

log p(y i |x i ,w = f (ε,φ)) (3)

where (x i , y i )Mi=1 is a minibatch of data with M randomdatapoints.

5 / 16

Page 6: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Variance of SGVB Estimator

Define Li = log p(y i |x i ,w = f (εi ,φ)), then LSGVBD (φ) = NM

∑Mi=1 Li .

The variance of LSGVBD (φ) is given by

Var[LSGVBD (φ)] =N2

M2(

M∑i=1

Var[Li ] + 2M∑i=1

M∑j=i+1

Cov[Li , Lj ])(4)

= N2(1

MVar[Li ] +

M − 1

MCov[Li , Lj ]) (5)

Contribution to variance by Var[Li ] inversely proportional tominibatch size M

Contribution by covariance does not decrease with M(Variance of LSGVBD (φ) can be dominated by covariances foreven moderately large M)

6 / 16

Page 7: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Local Reparameterization Trick

Var[LSGVBD (φ)] = N2(1

MVar[Li ] +

M − 1

MCov[Li , Lj ])

What this paper does:

propose an estimator for which we have Cov[Li , Lj ] = 0(variance scales as 1

M )

make estimator computationally efficient by not sampling εdirectly, but only sampling intermediate variables f (ε) throughwhich ε influence LSGVBD (φ)

7 / 16

Page 8: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Local Reparameterization Trick

Example:

A standard fully connected neural network containing a hiddenlayer consisting of 1000 neurons.

Hidden layer receives an M × 1000 input feature matrix A,which is multiplied by a 1000× 1000 weight matrix W, i.e.B = AW (before nonlinearity is applied).

Specify posterior on W to be Gaussian qφ(wi ,j) = N(µi ,j , σ2i ,j),

i.e. wi ,j = µi ,j + σi ,jεi ,j with εi ,j ∼ N(0, 1).

To ensure Cov[Li , Lj ] = 0,

sample a separate weight matrix W for each example inminibatch

not computationally efficient: need to sample M millionrandom numbers

8 / 16

Page 9: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Local Reparameterization Trick

Solution: local parameterization trick, ε→ f (ε)

The weights (and therefore ε) only influence the expected loglikelihood through neuron activations B)

Sample B directly, instead of sampling W or ε

Example: For a factorized Gaussian posterior on the weights,the posterior for activation is also factorized Gaussian.

qφ(wi ,j) = N(µi ,j , σ2i ,j) ∀wi ,j ∈W⇒ qφ(bm,j |A) = N(γm,j , δm,j),

with γm,j =1000∑i=1

am,iµi ,j , δm,j =1000∑i=1

a2m,iσ

2i ,j (6)

Computation cost: M × 1000, a thousand fold saving because

bm,j = γm,j +√δm,jζm,j , ζm,j ∼ N(0, 1)

Local parametrization trick leads to an estimator with lowervariance

9 / 16

Page 10: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Dropout

For a fully connected neural network, dropout corresponds to

B = (A ◦ ξ)θ, with ξi ,j ∼ p(ξi ,j)

A: M × K matrix of input featuresθ: K × L weight matrixB: M × L output matrix for current layer (before nonlinearity)ξ: M × K matrix of independent noise variable

ξ can be Bernoulli distributed

Gaussian distribution N(1, α) for ξ works as well or better

10 / 16

Page 11: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Variational Dropout

With independent weight noise: elements of ξ drawnindependently from N(1, α), then bm,j ∈ B is Gaussian

qφ(bm,j |A) = N(γm,j , δm,j),

with γm,j =K∑i=1

am,iθi ,j , and δm,j = α

K∑i=1

a2m,iθ

2i ,j (7)

Equation (7) can be interpreted as B = AW, whereqφ(wi ,j) = N(θi ,j , αθ

2i ,j).

With correlated weight noise:

B = (A ◦ ξ)θ, ξi ,j ∼ N(1, α)⇔ bm = amW, with

W = (w ′1,w

′2, · · · ,w ′

K )′,w i = siθi , with qφ(si ) = N(1, α) (8)

11 / 16

Page 12: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Scale-invariant Prior and Variational Objective

Scale-invariant Prior:

In dropout training, θ is adapted to maximize the expectedlog likelihood.

To be consistent with optimization of variational lower bound,choose prior p(w) such that DKL(qφ(w))||p(w)) not dependon θ. p(log |wi ,j |) ∝ cAbove is called scale invariant log-uniform prior.

Variational Objective:

qφ(W) can be decomposed into parameter θ which capturesthe mean, and a multiplicative noise term determined byparameter α.

Dropout maximizes following variational lower bound

Eqα [LD(θ)]− DKL(qα(w)||p(w)) (9)

12 / 16

Page 13: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Adaptive Dropout Rate

−DKL(qφ(w)||p(w)) is not analytically tractable, can beapproximated by

−DKL(qφ(wi )||p(wi )) ≈ const+0.5 log(α)+c1α+c2α2 +c3α

3

Maximize variational lower bound with respect to α, insteadof setting it.

13 / 16

Page 14: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Experiments

Datasets:

MNIST

CIFAR-10

Compare with three methods:

standard binary dropout

Gaussian dropout type A: correlated weight noise

Gaussian dropout type B: independent weight uncertainty

14 / 16

Page 15: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Experiments

Variance of gradient

Table 1: Average empirical variance of minibatch stochastic gradient estimates(1000 examples) for a fully connected neural network, regularized by variationaldropout with independent weight noise.

Speed:Without local parameterization trick: 1635 seconds per epochWith local parameterization trick: 7.4 seconds per epoch

15 / 16

Page 16: Variational Dropout and the Local …lcarin/Changwei1.29.2016.pdf2016/01/29  · Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments Scale-invariant

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Experiments

Figure 1: (a) Comparison of various dropout methods, when applied to fully-connected neural networks for classification. Shown is the classification error ofnetworks with 3 hidden layers, averaged over 5 runs. The variational versions ofGaussian dropout perform equal or better than their non-adaptive counterparts. (b)Comparison of dropout methods when applied to convolutional net for differentsettings of network size k. The network has two convolutional layers with each 32k and64k feature maps, followed by two fully connected layers with each 128k hidden units.

16 / 16