(研究会輪読) weight uncertainty in neural networks

WeightUncertainty inNeuralNetworks04/11/2015

MASAHIROSUZUKI

PaperInformationTitle:WeightUncertaintyinNeuralNetworks(ICML2015)

Authors:CharlesBlundell,Jullen Cornebise,Koray Kavukcuoglu,DaanWierstra◦ GoogleDeepMind

TheyproposedBayesbyBackprop

Motivation◦ I’dliketoknowhowtotreatthemodel “uncertainty’’indeeplearningapproach.

◦ IlikeBayesian approach.

Overfitting andUncertaintyOverfitting◦ Plainfeedforward neuralnetworksarepronetooverfitting.Uncertainty◦ NNoftenincapableofcorrectlyassessingtheuncertainty inthetrainingdata.

→Overlyconfidentdecisions

(a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: �(f(x))

Figure 1: A sketch of softmax input and output for an idealised binary classification problem.Training data is given between the dashed grey lines. Function point estimate is shown with a solidline. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a pointx

⇤ far from the training data. Ignoring function uncertainty, point x⇤ is classified as class 1 withprobability 1.have made use of NNs for Q-value function approximation. These are functions that estimate thequality of different actions an agent can make. Epsilon greedy search is often used where the agentselects its best action with some probability and explores otherwise. With uncertainty estimates overthe agent’s Q-value function, techniques such as Thompson sampling [11] can be used to learn muchfaster.

Bayesian probability theory offers us mathematically grounded tools to reason about model uncer-tainty, but these usually come with a prohibitive computational cost. It is perhaps surprising thenthat it is possible to cast recent deep learning tools as Bayesian models – without changing eitherthe models or the optimisation. We show that the use of dropout (and its variants) in NNs can beinterpreted as a Bayesian approximation of a well known probabilistic model: the Gaussian process(GP) [12]. Dropout is used in many models in deep learning as a way to avoid over-fitting [13],and our interpretation suggests that dropout approximately integrates over the models’ weights. Wedevelop tools for representing model uncertainty of existing dropout NNs – extracting informationthat has been thrown away so far. This mitigates the problem of representing model uncertainty indeep learning without sacrificing either computational complexity or test accuracy.

In this paper we give a complete theoretical treatment of the link between Gaussian processes anddropout, and develop the tools necessary to represent uncertainty in deep learning. We perform anextensive exploratory assessment of the properties of the uncertainty obtained from dropout NNs andconvnets on the tasks of regression and classification. We compare the uncertainty obtained fromdifferent model architectures and non-linearities in regression, and show that model uncertaintyis indispensable for classification tasks, using MNIST as a concrete example. We then show aconsiderable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods. Lastly we give a quantitative assessment of model uncertainty in the setting ofreinforcement learning, on a practical task similar to that used in deep reinforcement learning [14].

2 Related Research

It has long been known that infinitely wide (single hidden layer) NNs with distributions placed overtheir weights converge to Gaussian processes [15, 16]. This known relation is through a limit argu-ment that does not allow us to translate properties from the Gaussian process to finite NNs easily.Finite NNs with distributions placed over the weights have been studied extensively as Bayesianneural networks [15, 17]. These offer robustness to over-fitting as well, but with challenging infer-ence and additional computational costs. Variational inference has been applied to these models,but with limited success [18, 19, 20]. Recent advances in variational inference introduced newtechniques into the field such as sampling-based variational inference and stochastic variational in-ference [21, 22, 23, 24, 25]. These have been used to obtain new approximations for Bayesian neuralnetworks that perform as well as dropout [26]. However these models come with a prohibitive com-putational cost. To represent uncertainty, the number of parameters in these models is doubled forthe same network size. Further, they require more time to converge and do not improve on existingtechniques. Given that good uncertainty estimates can be cheaply obtained from common dropoutmodels, this results in unnecessary additional computation. An alternative approach to variationalinference makes use of expectation propagation [27] and has improved considerably in RMSE anduncertainty estimation on VI approaches such as [20]. In the results section we compare dropout tothese approaches and show a significant improvement in both RMSE and uncertainty estimation.

2

Trainingdata

Functionpoint

estimate

Withoutuncertainty:class1withprobability1

Withuncertainty:betterreflectsclassificationuncertainty

[Galet.al2015]

Functionuncertainty

HowtopreventoverfittingVariousregularizationschemeshavebeenproposed.◦ Earlystopping◦ Weightdecay◦ Dropout

Thispaperaddressesthisproblembyusingvariational Bayesianlearning tointroduceuncertaintyintheweightsofthenetwork.

↓

BayesbyBackprop

ContributionTheyproposedBayesbyBackprop◦ Asimpleapproximatelearningalgorithmsimilartobackpropagation.◦ Allweightarerepresentedbyprobabilitydistributionsoverpossiblevalues.

Itachievesgoodresultsinseveraldomains.◦ Classification◦ Regression◦ Banditproblem

Weight Uncertainty in Neural Networks

H1 H2 H3 1

X 1

Y

0.5 0.1 0.7 1.3

1.40.3

1.2

0.10.1 0.2

H1 H2 H3 1

X 1

Y

Figure 1. Left: each weight has a fixed value, as provided by clas-sical backpropagation. Right: each weight is assigned a distribu-tion, as provided by Bayes by Backprop.

is related to recent methods in deep, generative modelling(Kingma and Welling, 2014; Rezende et al., 2014; Gregoret al., 2014), where variational inference has been appliedto stochastic hidden units of an autoencoder. Whilst thenumber of stochastic hidden units might be in the order ofthousands, the number of weights in a neural network iseasily two orders of magnitude larger, making the optimisa-tion problem much larger scale. Uncertainty in the hiddenunits allows the expression of uncertainty about a particularobservation, uncertainty in the weights is complementaryin that it captures uncertainty about which neural networkis appropriate, leading to regularisation of the weights andmodel averaging.

This uncertainty can be used to drive exploration in contex-tual bandit problems using Thompson sampling (Thomp-son, 1933; Chapelle and Li, 2011; Agrawal and Goyal,2012; May et al., 2012). Weights with greater uncertaintyintroduce more variability into the decisions made by thenetwork, leading naturally to exploration. As more data areobserved, the uncertainty can decrease, allowing the deci-sions made by the network to become more deterministicas the environment is better understood.

The remainder of the paper is organised as follows: Sec-tion 2 introduces notation and standard learning in neuralnetworks, Section 3 describes variational Bayesian learn-ing for neural networks and our contributions, Section 4describes the application to contextual bandit problems,whilst Section 5 contains empirical results on a classifica-tion, a regression and a bandit problem. We conclude witha brief discussion in Section 6.

2. Point Estimates of Neural Networks

We view a neural network as a probabilistic modelP (y|x,w): given an input x 2 Rp a neural network as-signs a probability to each possible output y 2 Y , usingthe set of parameters or weights w. For classification, Y isa set of classes and P (y|x,w) is a categorical distribution –this corresponds to the cross-entropy or softmax loss, when

the parameters of the categorical distribution are passedthrough the exponential function then re-normalised. Forregression Y is R and P (y|x,w) is a Gaussian distribution– this corresponds to a squared loss.

Inputs x are mapped onto the parameters of a distribu-tion on Y by several successive layers of linear transforma-tion (given by w) interleaved with element-wise non-lineartransforms.

The weights can be learnt by maximum likelihood estima-tion (MLE): given a set of training examples D = (x

i

,y

i

)

i

,the MLE weights wMLE are given by:

w

MLE= argmax

wlogP (D|w)

= argmax

w

X

i

logP (y

i

|xi

,w).

This is typically achieved by gradient descent (e.g., back-propagation), where we assume that logP (D|w) is differ-entiable in w.

Regularisation can be introduced by placing a prior uponthe weights w and finding the maximum a posteriori(MAP) weights wMAP:

w

MAP= argmax

wlogP (w|D)

= argmax

wlogP (D|w) + logP (w).

If w are given a Gaussian prior, this yields L2 regularisa-tion (or weight decay). If w are given a Laplace prior, thenL1 regularisation is recovered.

3. Being Bayesian by Backpropagation

Bayesian inference for neural networks calculates the pos-terior distribution of the weights given the training data,P (w|D). This distribution answers predictive queriesabout unseen data by taking expectations: the predictivedistribution of an unknown label ˆ

y of a test data item ˆ

x,is given by P (

ˆ

y|ˆx) = EP (w|D)[P (

ˆ

y|ˆx,w)]. Each pos-sible configuration of the weights, weighted according tothe posterior distribution, makes a prediction about the un-known label given the test data item ˆ

x. Thus taking anexpectation under the posterior distribution on weights isequivalent to using an ensemble of an uncountably infi-nite number of neural networks. Unfortunately, this is in-tractable for neural networks of any practical size.

Previously Hinton and Van Camp (1993) and Graves(2011) suggested finding a variational approximation to theBayesian posterior distribution on the weights. Variationallearning finds the parameters ✓ of a distribution on theweights q(w|✓) that minimises the Kullback-Leibler (KL)

Classicalbackpropagation BayesbyBackprop

RelatedWorksVariational approximation◦ [Graves2011]→ thegradientsofthiscanbemadeunbiasedandthismethodcanbeusedwithnon-Gaussianpriors.

Uncertaintyinthehiddenunits◦ [Kingma andWelling,2014][Rezende etal.,2014][Gregor etal.,2014]◦ Variational autoencoder→ thenumberofweightsinaneuralnetworkiseasilytwoordersofmagnitudelarger

ContextualbanditproblemsusingThompsonsampling◦ [Thompson,1933][Chapelle andLi,2011][AgrawalandGoyal,2012][Mayetal.,2012]

→Weightswithgreateruncertaintyintroducemorevariabilityintothedecisionsmadebythenetwork,leadingnaturallytoexploration.

PointEstimatesofNNNeuralnetwork:𝑝(𝑦|𝑥,𝑤)◦ Input:𝑥 ∈ ℝ+

◦ Output :𝑦 ∈ 𝒴◦ Thesetofparameters:𝑤◦ Cross-entropy(categoricaldistiribution), squaredloss(Gaussiandistribution)

Learning𝐷 = (𝑥/, 𝑦/)◦ MLE:

𝑤012 = argmax8

log𝑝 𝐷 𝑤 = argmax8

;log𝑝(𝑦/ |𝑥/,𝑤)/

◦ MAP:𝑤0<= = argmax

8log𝑝 𝑤 𝐷 = argmax

8log𝑝(𝐷|𝑤) + log 𝑝(𝑤)

BeingBayesianThepredictivedistribution:𝑝(𝑦?|𝑥?)◦ anunknown label:𝑦?◦ atestdataitem:𝑥?

𝑝 𝑦? 𝑥? = @𝑝 𝑦? 𝑥?,𝑤 𝑝 𝑤 𝐷 𝑑𝑤 = 𝔼+(8|C)[𝑝 𝑦? 𝑥?,𝑤 ]

Takingexpectation=anensembleofanuncountably infinitenumberofNN↓

Intractable

Variational LearningTheBasyan posteriordistributionontheweight :𝑞(𝑤|𝜃)◦ parameters: 𝜃

Theposteriordistributiongiventhetrainingdata:𝑝 𝑤 𝐷

Findtheparameters𝜃thatminimizestheKL divergence:

𝜃∗ = arg minLℱ 𝐷,𝜃

whereℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝 𝑤 𝐷= 𝐾𝐿 𝑞 𝑤 𝜃 𝑝(𝑤) − 𝔼Q 𝑤 𝜃 [log𝑝(𝐷|𝑤)]

data-dependentpart(likelihoodcost)

prior-dependentpart(complexitycost)

UnbiasedMonteCarlogradientsProposition1.Let𝜀 bearandomvariablehavingaprobabilitydensitygivenby𝑞(𝜀) andlet𝑤 = 𝑡(𝜃, 𝜀)where𝑡(𝜃, 𝜀) isadeterministicfunction.Supposefurtherthatthemarginalprobabilitydensityof𝑤,𝑞(𝑤|𝜃),issuchthat𝑞(𝜀)𝑑𝜀 = 𝑞(𝑤|𝜃)𝑑𝑤.Thenforafunction𝑓 withderivativesin𝑤:

UUL 𝔼Q(8|L) 𝑓 𝑤, 𝜃 = 𝔼Q(V)

UW(8,L)UL

U8UL +

UW(8,L)UL

AgeneralizationoftheGaussianreparameterization trick

BayesbybackpropWeapproximatetheexpectedlowerboundas

ℱ 𝐷,𝜃 ≈;log𝑞 𝑤(/) 𝜃 − log𝑝(𝑤(/)) − log 𝑝(𝐷|𝑤 / )/

where𝑤(/)~𝑞 𝑤(/) 𝜃 (MonteCarlo)

◦ Everytermof thisapproximatecostdependsupon theparticularweightsdrawnfromthevariational posterior.

Gaussianvariational posterior1. Sample𝜀 ∼ 𝑁(0, 𝐼)2. Let𝑤 = 𝜇 + log(1 + exp(𝜌)) ∘ 𝜀3. Let𝜃 = (𝜇, 𝜌)

4. Let𝑓(𝑤, 𝜃) = log𝑞 𝑤 𝜃 − log𝑝 𝑤 𝑝(𝐷|𝑤)5. Calculatethegradientwithrespecttothemean

∆e = 𝜕𝑓(𝑤, 𝜃)𝜕𝑤 +

𝜕𝑓(𝑤, 𝜃)𝜕𝜇

6. Calculatethegradientwithrespecttothestandarddeviationparameter𝜌∆g=

𝜕𝑓(𝑤, 𝜃)𝜕𝑤

𝜀1+ exp(−𝜌) +

𝜕𝑓(𝑤, 𝜃)𝜕𝜌

7. Updatethevariational parameters:𝜇 ← 𝜇 − 𝛼∆e𝜌 ← 𝜌 − 𝛼∆g

ScalarmixturepriorTheyproposedusingscalemixtureoftwoGaussiandensities astheprior◦ CombinedwithadiagonalGaussianposterior◦ Twodegreesoffreedomperweightonlyincreasesthenumberofparameterstooptimizebyafactoroftwo.

𝑝(𝑤) = ∏ 𝜋𝑁(𝑤l |0,𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o)l

where 𝜎n > 𝜎o,𝜎o ≪ 1

Empiricallytheyfoundoptimizingtheparametersofaprior𝑝(𝑤) tonotbeuseful,andyieldworseresults.◦ Itcanbeeasiertochangethepriorparametersthantheposteriorparameters.◦ Thepriorparameterstrytocapturetheempiricaldistributionoftheweightsatthebeginningoflearning.

→ pick afixed-formpriorand don’t adjust its hyperparamater

Minibatches andKLre-weightingTheminibatch costforminibatch i =1,2,...,M:

𝐹/u(𝐷/, 𝜃) = 𝜋/𝐾𝐿[𝑞(𝑤|𝜃)||𝑃(𝑤)] − 𝔼Q 𝑤 𝜃 [log𝑃 𝐷/ 𝑤 ]whereπ∈ [0,1]0and∑ 𝜋/ = 10

/xn

◦ Then𝔼0 ∑ 𝐹/u 𝐷/ ,𝜃0/xn = 𝐹(𝐷,𝜃)

𝜋 = oyz{

oy|nworkswell◦ thefirstfewminibatches areheavilyinfluenced bythecomplexitycost◦ thelaterminibatches arelargelyinfluencedbythedata

ContextualBanditsContextualBandits◦ Simple reinforcementlearningproblems

Example:ClinicalDecisionMaking

Another Example: Clinical Decision Making

Repeatedly:

1. A patient comes to a doctor withsymptoms, medical history, test results

2. The doctor chooses a treatment

3. The patient responds to it

The doctor wants a policy for choosingtargeted treatments for individual patients.

Context𝑥

Actions𝑎 ∈ [0, . . , 𝐾]

Rewards𝑟~𝑝(𝑟|𝑥, 𝑎,𝑤)

Online trainingtheagent’smodel𝑝 𝑟 𝑥, 𝑎, 𝑤

ThompsonSamplingThompsonsampling◦ Popularmeansofpickinganactionthattrades-offbetweenexploitation andexploration.

◦ NecessitatesaBayesiantreatmentofthemodelparameters

1. Sampleanewsetofparametersforthemodel.

2. Picktheactionwiththehighestexpectedrewardaccordingtothesampledparameters.

3. Updatethemodel.Goto1.

ThompsonSamplingforNNThompsonsamplingiseasilyadaptedtoneuralnetworksusingthevariational pasterior.

1. Sampleweightsfromthevariational posterior:𝑤~𝑞(𝑤|𝜃).

2. Receivethecontext𝑥.

3. Picktheaction𝑎 thatminimises𝔼+(�|�,�,8)[𝑟]

4. Receivereward𝑟.

5. Updatevariationalparameters𝜃.Goto1.

Experiment1. ClassificationonMNIST

2. Regressioncurves

3. BanditsonMushroomTask

ClassificationonMNISTModel:◦ 28×28→(ReLU)→(400,800,1200) →(ReLU)→(400,800,1200) → (Softmax)→10


Table 1. Classification Error Rates on MNIST. ? indicates resultused an ensemble of 5 networks.

Method #U

nits/L

ayer

#W

eig

hts

Test

Error

SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6%SGD, dropout (Hinton et al., 2012) ⇡ 1.3%SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%?

SGD 400 500k 1.83%800 1.3m 1.84%1200 2.4m 1.88%

SGD, dropout 400 500k 1.51%800 1.3m 1.33%1200 2.4m 1.36%

Bayes by Backprop, Gaussian 400 500k 1.82%800 1.3m 1.99%1200 2.4m 2.04%

Bayes by Backprop, Scale mixture 400 500k 1.36%800 1.3m 1.34%1200 2.4m 1.32%

known that variational methods under-estimate uncertainty(Minka, 2001; 2005; Bishop, 2006) which could lead tounder-exploration and premature convergence in practice,but we did not find this in practice.

5. Experiments

We present some empirical evaluation of the methods pro-posed above: on MNIST classification, on a non-linear re-gression task, and on a contextual bandits task.

5.1. Classification on MNIST

We trained networks of various sizes on the MNIST dig-its dataset (LeCun and Cortes, 1998), consisting of 60,000training and 10,000 testing pixel images of size 28 by 28.Each image is labelled with its corresponding number (be-tween zero and nine, inclusive). We preprocessed the pix-els by dividing values by 126. Many methods have beenproposed to improve results on MNIST: generative pre-training, convolutions, distortions, etc. Here we shall focuson improving the performance of an ordinary feedforwardneural network without using any of these methods. Weused a network of two hidden layers of rectified linear units(Nair and Hinton, 2010; Glorot et al., 2011), and a softmaxoutput layer with 10 units, one for each possible label.

According to Hinton et al. (2012), the best published feed-forward neural network classification result on MNIST (ex-cluding those using data set augmentation, convolutions,etc.) is 1.6% (Simard et al., 2003), whilst dropout withan L2 regulariser attains errors around 1.3%. Results fromBayes by Backprop are shown in Table 1, for various sized

0.8

1.2

1.6

2.0

0 100 200 300 400 500 600Epochs

Test

erro

r (%

)

AlgorithmBayes by BackpropDropoutVanilla SGD

Figure 2. Test error on MNIST as training progresses.

0

5

10

15

−0.2 −0.1 0.0 0.1 0.2Weight

Den

sity

AlgorithmBayes by BackpropDropoutVanilla SGD

Figure 3. Histogram of the trained weights of the neural network,for Dropout, plain SGD, and samples from Bayes by Backprop.

networks, using either a Gaussian or Gaussian scale mix-ture prior. Performance is comparable to that of dropout,perhaps slightly better, as also see on Figure 2. Note thatwe trained on 50,000 digits and used 10,000 digits as a val-idation set, whilst Hinton et al. (2012) trained on 60,000digits and did not use a validation set. We used the vali-dation set to pick the best hyperparameters (learning rate,number of gradients to average) and so we also repeatedthis protocol for dropout and SGD (Stochastic Gradient De-scent on the MLE objective in Section 2). We consideredlearning rates of 10

�3, 10�4 and 10

�5 with minibatchesof size 128. For Bayes by Backprop, we averaged over ei-ther 1, 2, 5, or 10 samples and considered ⇡ 2 { 1

4 ,12 ,

34},

� log �1 2 {0, 1, 2} and � log �2 2 {6, 7, 8}.

Figure 2 shows the learning curves on the test set for Bayesby Backprop, dropout and SGD on a network with two lay-ers of 1200 rectified linear units. As can be seen, SGDconverges the quickest, initially obtaining a low test er-ror and then overfitting. Bayes by Backprop and dropoutconverge at similar rates (although each iteration of Bayesby Backprop is more expensive than dropout – around twotimes slower). Eventually Bayes by Backprop convergeson a better test error than dropout after 600 epochs.

Figure 3 shows density estimates of the weights. The Bayesby Backprop weights are sampled from the variational pos-terior, and the dropout weights are those used at test time.Interestingly the regularised networks found by dropout

usedanensembleof5networks

ClassificationonMNISTTesterroronMNISTastrainingprogress

◦ BayesbyBackprop anddropout convergeatsimilarrates.

ClassificationonMNISTDensityestimates oftheweights◦ BayesbyBackprop :sampledfromthevariational posterior◦ Dropout:usedattesttime

◦ BayesbyBackprop usesthegreatestrangeofweights

ClassificationonMNISTThelevelofredundancyinaBayesbyBackprop.◦ Replacetheweightswithaconstantzero(fromthelowestsignal-to-noiseratio).


and Bayes by Backprop have a greater range and with fewercentred at zero than those found by SGD. Bayes by Back-prop uses the greatest range of weights.

0.0

0.2

0.4

0.6

0.8

−5.0 −2.5 0.0Signal−to−Noise Ratio (dB)

Dens

ity

0.00

0.25

0.50

0.75

1.00

−7.5 −5.0 −2.5 0.0Signal−to−Noise Ratio (dB)

CDF

Figure 4. Density and CDF of the Signal-to-Noise ratio over allweights in the network. The red line denotes the 75% cut-off.

In Table 2, we examine the effect of replacing the vari-ational posterior on some of the weights with a constantzero, so as to determine the level of redundancy in thenetwork found by Bayes by Backprop. We took a Bayesby Backprop trained network with two layers of 1200units2 and ordered the weights by their signal-to-noise ra-tio (|µ

i

|/�i

). We removed the weights with the lowest sig-nal to noise ratio. As can be seen in Table 2, even when95% of the weights are removed the network still performswell, with a significant drop in performance once 98% ofthe weights have been removed.

In Figure 4 we examined the distribution of the signal-to-noise relative to the cut-off in the network uses in Table 2.The lower plot shows the cumulative distribution of signal-to-noise ratio, whilst the top plot shows the density. Fromthe density plot we see there are two modalities of signal-to-noise ratios, and from the CDF we see that the 75%cut-off separates these two peaks. These two peaks coin-cide with a drop in performance in Table 2 from 1.24%to 1.29%, suggesting that the signal-to-noise heuristic is infact related to the test performance.

2We used a network from the end of training rather than pick-ing a network with a low validation cost found during training,hence the disparity with results in Table 1. The lowest test errorobserved was 1.12%.

Table 2. Classification Errors after Weight pruningProportion removed # Weights Test Error

0% 2.4m 1.24%

50% 1.2m 1.24%

75% 600k 1.24%

95% 120k 1.29%

98% 48k 1.39%

It is interesting to contrast this weight removal approachto obtaining a fast, smaller, sparse network for predictionafter training with the approach taken by distillation (Hin-ton et al., 2014) which requires an extra stage of trainingto obtain a compressed prediction model. As with distil-lation, our method begins with an ensemble (one for eachpossible assignment of the weights). However, unlike dis-tillation, we can simply obtain a subset of this ensemble byusing the probabilistic properties of the weight distributionslearnt to gracefully prune the ensemble down into a smallernetwork. Thus even though networks trained by Bayes byBackprop may have twice as many weights, the number ofparameters that actually need to be stored at run time can befar fewer. Graves (2011) also considered pruning weightsusing the signal to noise ratio, but demonstrated results ona network 20 times smaller and did not prune as high aproportion of weights (at most 11%) whilst still maintain-ing good test performance. The scale mixture prior usedby Bayes by Backprop encourages a broad spread of theweights. Many of these weights can be successfully prunedwithout impacting performance significantly.

5.2. Regression curves

We generated training data from the curve:

y = x+ 0.3 sin(2⇡(x+ ✏)) + 0.3 sin(4⇡(x+ ✏)) + ✏

where ✏ ⇠ N (0, 0.02). Figure 5 shows two examples offitting a neural network to these data, minimising a condi-tional Gaussian loss. Note that in the regions of the inputspace where there are no data, the ordinary neural networkreduces the variance to zero and chooses to fit a particu-lar function, even though there are many possible extrap-olations of the training data. On the left, Bayesian modelaveraging affects predictions: where there are no data, theconfidence intervals diverge, reflecting there being manypossible extrapolations. In this case Bayes by Backpropprefers to be uncertain where there are no nearby data, asopposed to a standard neural network which can be overlyconfident.

5.3. Bandits on Mushroom Task

We take the UCI Mushrooms data set (Bache and Lichman,2013), and cast it as a bandit task, similar to Guez (2015,

RegressioncurvesTrainingdata:

𝑦 = 𝑥 + 0.3 sin 2𝜋 𝑥 + 𝜀 + 0.3 sin 4𝜋 𝑥 + 𝜀 + 𝜀

where𝜀~𝑁(0,0.02)Weight Uncertainty in Neural Networks

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xxx

x

xx

x

xxxx

x

x

xx

x

xxxxx x

x

x

x

xxx

x

xx

x

x

x

x

x

x

x

x

x

x xx

x

xx

xx

x

x x

xx

x

x

x

xx

xx

x

x

x

x

x

x

xx

xx

x x

x

x

xx

x

x

x

xx

xxxx

x

x

x

x

xx

xx

xxx

x x

x

x

xx x

xx

x

xxx

xx

x

x xx

x

x

xx x

x

xx

x

xx x

x

xxxx x

xx

x

xx

xx

x

x

x

x

x

x

xx

x xx

x

x

x

x x

xxx

xxx

x

x

x

xx

x

x

xx

x

xx

xx

xxx

x

x

x

xxx

xx

x

xx

x

x

xx

x

x

x

x

x

x

x

x

xx

x

xxx

x

x

x

x

x

x

x

xx

x

xx x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xx

xx

x

xx

xxx

x

x

xx

x

x

xx x

x

xx

x

x

x x

x

xx

x

x

x

x

x

xx

xx

x

x

xx

x

x

xx

x

x

xxx

xx

xx x

x

x

xx

x

x

xx

x

x

xxx

x

xx

xxx xx

xxxx

xx

x

x

x

xx x

xx

xx

x

x x

x

x

xx

xx

x

xxxxx

xx

xxx

x xx

x

xx x

x x

x

x

xx

x

x

x

x

x

xxx

x

x

x

xx

xx

xx

x

x

x

x

xx

x

x

x

x x

x

x

x

x

x

xx

x

xx

x

xx

x

x

x

xx

x

x

xx

xxx

xx

x

x

xxx

x

xx

x

x

xxx

x

xx

x

x

x

xx

x

x

xx

x

xx

x

x

xx

xx

xxx

x

x

xx

xx

x

x

x x

x

xx

x

xx

x

xx

xx

xx

x

x

x

x

x

xx

x

xxx

x xx

xx

xxx

x

xx

x

xxxx

xx

x

xx

x

xxx

x

x x

x

xx

x

x

x

x

xx

x

x

xxx

xx

x

x

x

xx

x

xx

xx

xx

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

−0.4

0.0

0.4

0.8

1.2

0.0 0.4 0.8 1.2

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xxx

x

xx

x

xxxx

x

x

xx

x

xxxxx x

x

x

x

xxx

x

xx

x

x

x

x

x

x

x

x

x

x xx

x

xx

xx

x

x x

xx

x

x

x

xx

xx

x

x

x

x

x

x

xx

xx

x x

x

x

xx

x

x

x

xx

xxxx

x

x

x

x

xx

xx

xxx

x x

x

x

xx x

xx

x

xxx

xx

x

x xx

x

x

xx x

x

xx

x

xx x

x

xxxx x

xx

x

xx

xx

x

x

x

x

x

x

xx

x xx

x

x

x

x x

xxx

xxx

x

x

x

xx

x

x

xx

x

xx

xx

xxx

x

x

x

xxx

xx

x

xx

x

x

xx

x

x

x

x

x

x

x

x

xx

x

xxx

x

x

x

x

x

x

x

xx

x

xx x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xx

xx

x

xx

xxx

x

x

xx

x

x

xx x

x

xx

x

x

x x

x

xx

x

x

x

x

x

xx

xx

x

x

xx

x

x

xx

x

x

xxx

xx

xx x

x

x

xx

x

x

xx

x

x

xxx

x

xx

xxx xx

xxxx

xx

x

x

x

xx x

xx

xx

x

x x

x

x

xx

xx

x

xxxxx

xx

xxx

x xx

x

xx x

x x

x

x

xx

x

x

x

x

x

xxx

x

x

x

xx

xx

xx

x

x

x

x

xx

x

x

x

x x

x

x

x

x

x

xx

x

xx

x

xx

x

x

x

xx

x

x

xx

xxx

xx

x

x

xxx

x

xx

x

x

xxx

x

xx

x

x

x

xx

x

x

xx

x

xx

x

x

xx

xx

xxx

x

x

xx

xx

x

x

x x

x

xx

x

xx

x

xx

xx

xx

x

x

x

x

x

xx

x

xxx

x xx

xx

xxx

x

xx

x

xxxx

xx

x

xx

x

xxx

x

x x

x

xx

x

x

x

x

xx

x

x

xxx

xx

x

x

x

xx

x

xx

xx

xx

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

−0.4

0.0

0.4

0.8

1.2

0.0 0.4 0.8 1.2

Figure 5. Regression of noisy data with interquatile ranges. Blackcrosses are training samples. Red lines are median predictions.Blue/purple region is interquartile range. Left: Bayes by Back-prop neural network, Right: standard neural network.

1000

10000

0 10000 20000 30000 40000 50000Step

Cum

ulat

ive R

egre

t

5% Greedy1% GreedyGreedyBayes by Backprop

Figure 6. Comparison of cumulative regret of various agents onthe mushroom bandit task, averaged over five runs. Lower is bet-ter.

Chapter 6). Each mushroom has a set of features, which wetreat as the context for the bandit, and is labelled as edibleor poisonous. An agent can either eat or not eat a mush-room. If an agent eats an edible mushroom, then it receivesa reward of 5. If an agent eats a poisonous mushroom, thenwith probability 1

2 it receives a reward of �35, otherwisea reward of 5. If an agent elects not to eat a mushroom,it receives a reward of 0. Thus an agent expects to receivea reward of 5 for eating an edible reward, but an expectedreward of �15 for eating a poisonous mushroom.

Regret measures the difference between the reward achiev-able by an oracle and the reward received by an agent. Inthis case, an oracle will always receive a reward of 5 for anedible mushroom, or 0 for a poisonous mushroom. We takethe cumulative sum of regret of several agents and showthem in Figure 6. Each agent uses a neural network withtwo hidden layers of 100 rectified linear units. The inputto the network is a vector consisting of the mushroom fea-tures (context) and a one of K encoding of the action. Theoutput of the network is a single scalar, representing the ex-pected reward of the given action in the given context. ForBayes by Backprop, we sampled the weights twice and av-eraged two of these outputs to obtain the expected reward

for action selection. We kept the last 4096 reward, contextand action tuples in a buffer, and trained the networks us-ing randomly drawn minibatches of size 64 for 64 trainingsteps (64⇥ 64 = 4096) per interaction with the Mushroombandit. A common heuristic for trading-off exploration vs.exploitation is to follow an "-greedy policy: with proba-bility " propose a uniformly random action, otherwise pickthe best action according to the neural network.

Figure 6 compares a Bayes by Backprop agent with three"-greedy agents, for values of " of 0% (pure greedy), 1%,and 5%. An " of 5% appears to over-explore, whereas apurely greedy agent does poorly at the beginning, greed-ily electing to eat nothing, but then does much better onceit has seen enough data. It seems that non-local functionapproximation updates allow the greedy agent to explore,as for the first 1, 000 steps, the agent eats nothing but afterapproximately 1, 000 the greedy agent suddenly decides toeat mushrooms. The Bayes by Backprop agent exploresfrom the beginning, both eating and ignoring mushroomsand quickly converges on eating and non-eating with an al-most perfect rate (hence the almost flat regret).

6. Discussion

We introduced a new algorithm for learning neural net-works with uncertainty on the weights called Bayes byBackprop. It optimises a well-defined objective functionto learn a distribution on the weights of a neural network.The algorithm achieves good results in several domains.When classifying MNIST digits, performance from Bayesby Backprop is comparable to that of dropout. We demon-strated on a simple non-linear regression problem that theuncertainty introduced allows the network to make morereasonable predictions about unseen data. Finally, for con-textual bandits, we showed how Bayes by Backprop canautomatically learn how to trade-off exploration and ex-ploitation. Since Bayes by Backprop simply uses gradientupdates, it can readily be scaled using multi-machine opti-misation schemes such as asynchronous SGD (Dean et al.,2012). Furthermore, all of the operations used are readilyimplemented on a GPU.

Acknowledgements The authors would like to thank IvoDanihelka, Danilo Rezende, Silvia Chiappa, Alex Graves,Remi Munos, Ben Coppin, Liam Clancy, James Kirk-patrick, Shakir Mohamed, David Pfau, and Theophane We-ber for useful discussions and comments.

References

Shipra Agrawal and Navin Goyal. Analysis of Thompsonsampling for the multi-armed bandit problem. In Pro-ceedings of the 25th Annual Conference On Learning

BayesbyBackprop StandardNN

BanditsonMushroomTaskUCIMushroomsdataset◦ Eachmushroom hasasetoffeaturesandislabelledasedibleorpoisonous.◦ Anediblemushroom :arewardof5◦ Apoisonous mushroom :arewardof−35or5(withprobability0.5)◦ Nottoeatamushroom :arewardof0

Model:◦ Input(contextandaction)→(ReLU)→100→(ReLU)→100→ theexpectedreward

Comparisonapproach:◦ 𝜀-greedypolicy:withprobability𝜀 proposeauniformly randomaction,otherwisepickthebestactionaccordingtotheneuralnetwork.

BanditsonMushroomTaskComparisonofcumulativeregretofvariousagents.◦ Regret:thedifferencebetweentherewardachievablebyanoracleandtherewardreceivedbyanagent.◦ Oracle:anediblemushroom→5, apoisonous mushroom→0

◦ Lowerisbetter.


x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xxx

x

xx

x

xxxx

x

x

xx

x

xxxxx x

x

x

x

xxx

x

xx

x

x

x

x

x

x

x

x

x

x xx

x

xx

xx

x

x x

xx

x

x

x

xx

xx

x

x

x

x

x

x

xx

xx

x x

x

x

xx

x

x

x

xx

xxxx

x

x

x

x

xx

xx

xxx

x x

x

x

xx x

xx

x

xxx

xx

x

x xx

x

x

xx x

x

xx

x

xx x

x

xxxx x

xx

x

xx

xx

x

x

x

x

x

x

xx

x xx

x

x

x

x x

xxx

xxx

x

x

x

xx

x

x

xx

x

xx

xx

xxx

x

x

x

xxx

xx

x

xx

x

x

xx

x

x

x

x

x

x

x

x

xx

x

xxx

x

x

x

x

x

x

x

xx

x

xx x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xx

xx

x

xx

xxx

x

x

xx

x

x

xx x

x

xx

x

x

x x

x

xx

x

x

x

x

x

xx

xx

x

x

xx

x

x

xx

x

x

xxx

xx

xx x

x

x

xx

x

x

xx

x

x

xxx

x

xx

xxx xx

xxxx

xx

x

x

x

xx x

xx

xx

x

x x

x

x

xx

xx

x

xxxxx

xx

xxx

x xx

x

xx x

x x

x

x

xx

x

x

x

x

x

xxx

x

x

x

xx

xx

xx

x

x

x

x

xx

x

x

x

x x

x

x

x

x

x

xx

x

xx

x

xx

x

x

x

xx

x

x

xx

xxx

xx

x

x

xxx

x

xx

x

x

xxx

x

xx

x

x

x

xx

x

x

xx

x

xx

x

x

xx

xx

xxx

x

x

xx

xx

x

x

x x

x

xx

x

xx

x

xx

xx

xx

x

x

x

x

x

xx

x

xxx

x xx

xx

xxx

x

xx

x

xxxx

xx

x

xx

x

xxx

x

x x

x

xx

x

x

x

x

xx

x

x

xxx

xx

x

x

x

xx

x

xx

xx

xx

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

−0.4

0.0

0.4

0.8

1.2

0.0 0.4 0.8 1.2

x

xx

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xxx

x

xx

x

xxxx

x

x

xx

x

xxxxx x

x

x

x

xxx

x

xx

x

x

x

x

x

x

x

x

x

x xx

x

xx

xx

x

x x

xx

x

x

x

xx

xx

x

x

x

x

x

x

xx

xx

x x

x

x

xx

x

x

x

xx

xxxx

x

x

x

x

xx

xx

xxx

x x

x

x

xx x

xx

x

xxx

xx

x

x xx

x

x

xx x

x

xx

x

xx x

x

xxxx x

xx

x

xx

xx

x

x

x

x

x

x

xx

x xx

x

x

x

x x

xxx

xxx

x

x

x

xx

x

x

xx

x

xx

xx

xxx

x

x

x

xxx

xx

x

xx

x

x

xx

x

x

x

x

x

x

x

x

xx

x

xxx

x

x

x

x

x

x

x

xx

x

xx x

x

x

x

x

xx

xx

xx

x

x

x

x

xx

xx

xx

x

xx

xxx

x

x

xx

x

x

xx x

x

xx

x

x

x x

x

xx

x

x

x

x

x

xx

xx

x

x

xx

x

x

xx

x

x

xxx

xx

xx x

x

x

xx

x

x

xx

x

x

xxx

x

xx

xxx xx

xxxx

xx

x

x

x

xx x

xx

xx

x

x x

x

x

xx

xx

x

xxxxx

xx

xxx

x xx

x

xx x

x x

x

x

xx

x

x

x

x

x

xxx

x

x

x

xx

xx

xx

x

x

x

x

xx

x

x

x

x x

x

x

x

x

x

xx

x

xx

x

xx

x

x

x

xx

x

x

xx

xxx

xx

x

x

xxx

x

xx

x

x

xxx

x

xx

x

x

x

xx

x

x

xx

x

xx

x

x

xx

xx

xxx

x

x

xx

xx

x

x

x x

x

xx

x

xx

x

xx

xx

xx

x

x

x

x

x

xx

x

xxx

x xx

xx

xxx

x

xx

x

xxxx

xx

x

xx

x

xxx

x

x x

x

xx

x

x

x

x

xx

x

x

xxx

xx

x

x

x

xx

x

xx

xx

xx

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

−0.4

0.0

0.4

0.8

1.2

0.0 0.4 0.8 1.2

Figure 5. Regression of noisy data with interquatile ranges. Blackcrosses are training samples. Red lines are median predictions.Blue/purple region is interquartile range. Left: Bayes by Back-prop neural network, Right: standard neural network.

1000

10000

0 10000 20000 30000 40000 50000Step

Cum

ulat

ive R

egre

t

5% Greedy1% GreedyGreedyBayes by Backprop

Figure 6. Comparison of cumulative regret of various agents onthe mushroom bandit task, averaged over five runs. Lower is bet-ter.

Chapter 6). Each mushroom has a set of features, which wetreat as the context for the bandit, and is labelled as edibleor poisonous. An agent can either eat or not eat a mush-room. If an agent eats an edible mushroom, then it receivesa reward of 5. If an agent eats a poisonous mushroom, thenwith probability 1

2 it receives a reward of �35, otherwisea reward of 5. If an agent elects not to eat a mushroom,it receives a reward of 0. Thus an agent expects to receivea reward of 5 for eating an edible reward, but an expectedreward of �15 for eating a poisonous mushroom.

Regret measures the difference between the reward achiev-able by an oracle and the reward received by an agent. Inthis case, an oracle will always receive a reward of 5 for anedible mushroom, or 0 for a poisonous mushroom. We takethe cumulative sum of regret of several agents and showthem in Figure 6. Each agent uses a neural network withtwo hidden layers of 100 rectified linear units. The inputto the network is a vector consisting of the mushroom fea-tures (context) and a one of K encoding of the action. Theoutput of the network is a single scalar, representing the ex-pected reward of the given action in the given context. ForBayes by Backprop, we sampled the weights twice and av-eraged two of these outputs to obtain the expected reward

for action selection. We kept the last 4096 reward, contextand action tuples in a buffer, and trained the networks us-ing randomly drawn minibatches of size 64 for 64 trainingsteps (64⇥ 64 = 4096) per interaction with the Mushroombandit. A common heuristic for trading-off exploration vs.exploitation is to follow an "-greedy policy: with proba-bility " propose a uniformly random action, otherwise pickthe best action according to the neural network.

Figure 6 compares a Bayes by Backprop agent with three"-greedy agents, for values of " of 0% (pure greedy), 1%,and 5%. An " of 5% appears to over-explore, whereas apurely greedy agent does poorly at the beginning, greed-ily electing to eat nothing, but then does much better onceit has seen enough data. It seems that non-local functionapproximation updates allow the greedy agent to explore,as for the first 1, 000 steps, the agent eats nothing but afterapproximately 1, 000 the greedy agent suddenly decides toeat mushrooms. The Bayes by Backprop agent exploresfrom the beginning, both eating and ignoring mushroomsand quickly converges on eating and non-eating with an al-most perfect rate (hence the almost flat regret).

6. Discussion

We introduced a new algorithm for learning neural net-works with uncertainty on the weights called Bayes byBackprop. It optimises a well-defined objective functionto learn a distribution on the weights of a neural network.The algorithm achieves good results in several domains.When classifying MNIST digits, performance from Bayesby Backprop is comparable to that of dropout. We demon-strated on a simple non-linear regression problem that theuncertainty introduced allows the network to make morereasonable predictions about unseen data. Finally, for con-textual bandits, we showed how Bayes by Backprop canautomatically learn how to trade-off exploration and ex-ploitation. Since Bayes by Backprop simply uses gradientupdates, it can readily be scaled using multi-machine opti-misation schemes such as asynchronous SGD (Dean et al.,2012). Furthermore, all of the operations used are readilyimplemented on a GPU.

Acknowledgements The authors would like to thank IvoDanihelka, Danilo Rezende, Silvia Chiappa, Alex Graves,Remi Munos, Ben Coppin, Liam Clancy, James Kirk-patrick, Shakir Mohamed, David Pfau, and Theophane We-ber for useful discussions and comments.

References

Shipra Agrawal and Navin Goyal. Analysis of Thompsonsampling for the multi-armed bandit problem. In Pro-ceedings of the 25th Annual Conference On Learning

over-explore

quicklyconverges

ConclusionTheyintroducedanewalgorithmforlearningneuralnetworkswithuncertaintyontheweightscalledBayesbyBackprop.

Thealgorithmachievesgoodresultsinseveraldomains.◦ ClassifyingMNISTdigits:Performance fromBayesbyBackprop iscomparabletothatofdropout.

◦ Non-linearregressionproblem :BayesbyBackprop allowsthenetworktomakemorereasonablepredictionsaboutunseendata.

◦ Contextualbandits :BayesbyBackprop canautomaticallylearnhowtotrade-offexplorationandexploitation.

(研究会輪読) weight uncertainty in neural networks

Technology