variational dropout and the local …lcarin/changwei1.29.2016.pdf2016/01/29 · main idea contents...

Variational Dropout and the Local Reparameterization Trick

Diederik Kingma, Tim Salimans, Max Welling

Presented by: Changwei Hu

Jan 29, 2016

1 / 16

Main Idea Contents Background Local Reparameterization Trick Variational Dropout Experiments

Main Idea

When variance of gradients is large, stochastic gradientdescent may fail

Propose a SGVB estimator that has variance inverselyproportional to minibatch size

Use local parameterization trick to make estimatorcomputationally efficient

Propose variational dropout under the framework ofvariational inference

Dropout rate is learned instead of being fixed

2 / 16


Contents

1 Background

2 Local Reparameterization Trick

3 Variational Dropout

4 Experimental Results

3 / 16


Variational Inference

Optimize variational parameter φ of some parameterizedmodel qφ(w) such that qφ(w) is a close approximation totrue posterior p(w |D).w : parameters or weights for the model; D: data

In practice, maximize variational lower bound L(φ) of themarginal likelihood of the data:

L(φ) = −DKL(qφ(w)||p(w)) + LD(φ) (1)

where LD(φ) =∑

(x ,y)∈D

Eqφ(w)[log p(y |x ,w)] (2)

LD(φ): expected log-likelihood(x , y) ∈ D: observation of tuples

4 / 16


Stochastic Gradient Variational Bayes (SGVB)

SGVB parameterizes random parameters w ∼ qφ(w) asw = f (ε,φ).f (·): differentiable function; ε ∼ p(ε): random noise variable

Unbiased minibatch-based Monte Carlo estimator of expectedlog-likelihood can be formed:

LD(φ) ' LSGVBD (φ) =N

M

M∑i=1

log p(y i |x i ,w = f (ε,φ)) (3)

where (x i , y i )Mi=1 is a minibatch of data with M randomdatapoints.

5 / 16


Variance of SGVB Estimator

Define Li = log p(y i |x i ,w = f (εi ,φ)), then LSGVBD (φ) = NM

∑Mi=1 Li .

The variance of LSGVBD (φ) is given by

Var[LSGVBD (φ)] =N2

M2(

M∑i=1

Var[Li ] + 2M∑i=1

M∑j=i+1

Cov[Li , Lj ])(4)

= N2(1

MVar[Li ] +

M − 1

MCov[Li , Lj ]) (5)

Contribution to variance by Var[Li ] inversely proportional tominibatch size M

Contribution by covariance does not decrease with M(Variance of LSGVBD (φ) can be dominated by covariances foreven moderately large M)

6 / 16


Local Reparameterization Trick

Var[LSGVBD (φ)] = N2(1

MVar[Li ] +

M − 1

MCov[Li , Lj ])

What this paper does:

propose an estimator for which we have Cov[Li , Lj ] = 0(variance scales as 1

M )

make estimator computationally efficient by not sampling εdirectly, but only sampling intermediate variables f (ε) throughwhich ε influence LSGVBD (φ)

7 / 16



Example:

A standard fully connected neural network containing a hiddenlayer consisting of 1000 neurons.

Hidden layer receives an M × 1000 input feature matrix A,which is multiplied by a 1000× 1000 weight matrix W, i.e.B = AW (before nonlinearity is applied).

Specify posterior on W to be Gaussian qφ(wi ,j) = N(µi ,j , σ2i ,j),

i.e. wi ,j = µi ,j + σi ,jεi ,j with εi ,j ∼ N(0, 1).

To ensure Cov[Li , Lj ] = 0,

sample a separate weight matrix W for each example inminibatch

not computationally efficient: need to sample M millionrandom numbers

8 / 16



Solution: local parameterization trick, ε→ f (ε)

The weights (and therefore ε) only influence the expected loglikelihood through neuron activations B)

Sample B directly, instead of sampling W or ε

Example: For a factorized Gaussian posterior on the weights,the posterior for activation is also factorized Gaussian.

qφ(wi ,j) = N(µi ,j , σ2i ,j) ∀wi ,j ∈W⇒ qφ(bm,j |A) = N(γm,j , δm,j),

with γm,j =1000∑i=1

am,iµi ,j , δm,j =1000∑i=1

a2m,iσ

2i ,j (6)

Computation cost: M × 1000, a thousand fold saving because

bm,j = γm,j +√δm,jζm,j , ζm,j ∼ N(0, 1)

Local parametrization trick leads to an estimator with lowervariance

9 / 16


Dropout

For a fully connected neural network, dropout corresponds to

B = (A ◦ ξ)θ, with ξi ,j ∼ p(ξi ,j)

A: M × K matrix of input featuresθ: K × L weight matrixB: M × L output matrix for current layer (before nonlinearity)ξ: M × K matrix of independent noise variable

ξ can be Bernoulli distributed

Gaussian distribution N(1, α) for ξ works as well or better

10 / 16


Variational Dropout

With independent weight noise: elements of ξ drawnindependently from N(1, α), then bm,j ∈ B is Gaussian

qφ(bm,j |A) = N(γm,j , δm,j),

with γm,j =K∑i=1

am,iθi ,j , and δm,j = α

K∑i=1

a2m,iθ

2i ,j (7)

Equation (7) can be interpreted as B = AW, whereqφ(wi ,j) = N(θi ,j , αθ

2i ,j).

With correlated weight noise:

B = (A ◦ ξ)θ, ξi ,j ∼ N(1, α)⇔ bm = amW, with

W = (w ′1,w

′2, · · · ,w ′

K )′,w i = siθi , with qφ(si ) = N(1, α) (8)

11 / 16


Scale-invariant Prior and Variational Objective

Scale-invariant Prior:

In dropout training, θ is adapted to maximize the expectedlog likelihood.

To be consistent with optimization of variational lower bound,choose prior p(w) such that DKL(qφ(w))||p(w)) not dependon θ. p(log |wi ,j |) ∝ cAbove is called scale invariant log-uniform prior.

Variational Objective:

qφ(W) can be decomposed into parameter θ which capturesthe mean, and a multiplicative noise term determined byparameter α.

Dropout maximizes following variational lower bound

Eqα [LD(θ)]− DKL(qα(w)||p(w)) (9)

12 / 16


Adaptive Dropout Rate

−DKL(qφ(w)||p(w)) is not analytically tractable, can beapproximated by

−DKL(qφ(wi )||p(wi )) ≈ const+0.5 log(α)+c1α+c2α2 +c3α

3

Maximize variational lower bound with respect to α, insteadof setting it.

13 / 16


Experiments

Datasets:

MNIST

CIFAR-10

Compare with three methods:

standard binary dropout

Gaussian dropout type A: correlated weight noise

Gaussian dropout type B: independent weight uncertainty

14 / 16


Experiments

Variance of gradient

Table 1: Average empirical variance of minibatch stochastic gradient estimates(1000 examples) for a fully connected neural network, regularized by variationaldropout with independent weight noise.

Speed:Without local parameterization trick: 1635 seconds per epochWith local parameterization trick: 7.4 seconds per epoch

15 / 16


Experiments

Figure 1: (a) Comparison of various dropout methods, when applied to fully-connected neural networks for classification. Shown is the classification error ofnetworks with 3 hidden layers, averaged over 5 runs. The variational versions ofGaussian dropout perform equal or better than their non-adaptive counterparts. (b)Comparison of dropout methods when applied to convolutional net for differentsettings of network size k. The network has two convolutional layers with each 32k and64k feature maps, followed by two fully connected layers with each 128k hidden units.

16 / 16

variational dropout and the local …lcarin/changwei1.29.2016.pdf2016/01/29 · main idea contents...

Documents