understanding priors in bayesian neural networks at the ... · we investigate deep bayesian neural...

Understanding Priors in Bayesian Neural Networks at the Unit Level

Mariia Vladimirova 1 2 Jakob Verbeek 1 Pablo Mesejo 3 Julyan Arbel 1

AbstractWe investigate deep Bayesian neural networkswith Gaussian weight priors and a class of ReLU-like nonlinearities. Bayesian neural networks withGaussian priors are well known to induce an L2,“weight decay”, regularization. Our results char-acterize a more intricate regularization effect atthe level of the unit activations. Our main resultestablishes that the induced prior distribution onthe units before and after activation becomes in-creasingly heavy-tailed with the depth of the layer.We show that first layer units are Gaussian, sec-ond layer units are sub-exponential, and units indeeper layers are characterized by sub-Weibulldistributions. Our results provide new theoreticalinsight on deep Bayesian neural networks, whichwe corroborate with simulation experiments.

1. IntroductionNeural networks (NNs), and their deep counterparts (Good-fellow et al., 2016), have largely been used in many researchareas such as image analysis (Krizhevsky et al., 2012),signal processing (Graves et al., 2013), or reinforcementlearning (Silver et al., 2016), just to name a few. The im-pressive performance provided by such machine learningapproaches has greatly motivated research that aims at abetter understanding the driving mechanisms behind theireffectiveness. In particular, the study of the NNs distri-butional properties through Bayesian analysis has recentlygained much attention.

Bayesian approaches investigate models by assuming a priordistribution on their parameters. Bayesian machine learningrefers to extending standard machine learning approacheswith posterior inference, a line of research pioneered by

1Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK,38000 Grenoble, France 2Moscow Institute of Physics and Tech-nology, 141701 Dolgoprudny, Russia 3Andalusian Research In-stitute in Data Science and Computational Intelligence (DaSCI),University of Granada, 18071 Granada, Spain. Correspondence to:Mariia Vladimirova <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, USA, PMLR 80, 2019. Copyright 2019 bythe author(s).

works on Bayesian neural networks (Neal, 1992; MacKay,1992). There is a large variety of applications, e.g. geneselection (Liang et al., 2018), and the range of models isnow very broad, including e.g. Bayesian generative adver-sarial networks (Saatci & Wilson, 2017). See Polson &Sokolov (2017) for a review. The interest of the Bayesianapproach to NNs is at least twofold. First, it offers a prin-cipled approach for modeling uncertainty of the trainingprocedure, which is a limitation of standard NNs which onlyprovide point estimates. A second main asset of Bayesianmodels is that they represent regularized versions of theirclassical counterparts. For instance, maximum a posteri-ori (MAP) estimation of a Bayesian regression model withdouble exponential (Laplace) prior is equivalent to Lassoregression (Tibshirani, 1996), while a Gaussian prior leadsto ridge regression. When it comes to NNs, the regulariza-tion mechanism is also well appreciated in the literature,since they traditionally suffer from overparameterization,resulting in overfitting.

Central in the field of regularization techniques is the weightdecay penalty (Krogh & Hertz, 1991), which is equivalentto MAP estimation of a Bayesian neural network with in-dependent Gaussian priors on the weights. Dropout hasrecently been suggested as a regularization method in whichneurons are randomly turned off (Srivastava et al., 2014),and Gal & Ghahramani (2016) proved that a neural net-work with arbitrary depth and non-linearities, with dropoutapplied before every weight layer, is mathematically equiva-lent to an approximation to the probabilistic deep Gaussianprocess (Damianou & Lawrence, 2013), leading to the con-sideration of such NNs as Bayesian models.

This paper is devoted to the investigation of hidden unitsprior distributions in Bayesian neural networks under theassumption of independent Gaussian weights. We first de-scribe a fully connected neural network architecture as illus-trated in Figure 1. Given an input x ∈ RN , the `-th hiddenlayer unit activations are defined as

g(`)(x) = W (`)h(`−1)(x), h(`)(x) = φ(g(`)(x)),(1)

where W (`) is a weight matrix including the bias vector.A nonlinear activation function φ : R → R is appliedelement-wise, which is called nonlinearity, g(`) = g(`)(x)

is a vector of pre-nonlinearities, and h(`) = h(`)(x) is a

arX

iv:1

810.

0519

3v2

[st

at.M

L]

10

May

201

9


...

......

...

...

...

...

...

xh(1) h(2) h(3) h(`)

inputlayer

1st hid.layer

2nd hid.layer

3rd hid.layer

`th hid.layer

subW( 12 ) subW(1) subW( 3

2 ) subW( `2 )

Figure 1. Neural network architecture and characterization of the`-layer units prior distribution as sub-Weibull distribution with tailparameter `/2, see Definition 3.3.

vector of post-nonlinearities. When we refer to either pre-or post-nonlinearities, we will use the notation U (`).

Contributions. In this paper, we extend the theoretical un-derstanding of feedforward fully connected NNs by studyingprior distributions at the units level, under the assumptionof independent and normally distributed weights. Our con-tributions are the following:

(i) As our main contribution, we prove in Theorem 3.1that under some conditions on the activation functionφ, a Gaussian prior on the weights induces a sub-Weibull distribution on the units (both pre- and post-nonlinearities) with optimal tail parameter θ = `/2, seeFigure 1. The condition on φ essentially imposes that φstrikes at a linear rate to +∞ or −∞ for large absolutevalues of the argument, as ReLU does. In the case ofbounded support φ, like sigmoid or tanh, the units arebounded, making them de facto sub-Gaussian1

(ii) We offer an interpretation of the main result from amore elaborate regularization scheme at the level ofthe units in Section 4.

In the remainder of the paper, we first discuss related work,and then present our main contributions starting with the nec-essary statistical background and theoretical results (i), thenmoving to intuitions and interpretation (ii), and ending upwith the description of the experiments and the discussion ofthe results obtained. More specifically, Section 3 states ourmain contribution, Theorem 3.1, with a proof sketch whileadditional technical results are deferred to Supplementarymaterial. Section 4 illustrates penalization techniques, pro-viding an interpretation for the theorem. Section 5 describes

1A trivial version of our main result holds, see Remark 3.1.

the experiments. Conclusions and directions for future workare presented in Section 6.

2. Related workStudying the distributional behaviour of feedforward net-works has been a fruitful avenue for understanding thesemodels, as pioneered by the works of Radford Neal (Neal,1992; 1996) and David MacKay (MacKay, 1992). Thefirst results in the field addressed the limiting setting whenthe number of units per layer tends to infinity, also calledthe wide regime. Neal (1996) proved that a single hid-den layer neural network with normally distributed weightstends in distribution in the wide limit either to a Gaussianprocess (Rasmussen & Williams, 2006) or to an α-stableprocess, depending on how the prior variance on the weightsis rescaled. In recent works, Matthews et al. (2018b), orits extended version Matthews et al. (2018a), and Lee et al.(2018) extend the result of Neal to more-than-one-layerneural networks: when the number of hidden units growsto infinity, deep neural networks (DNNs) also tend in dis-tribution to the Gaussian process, under the assumptionof Gaussian weights for properly rescaled prior variances.For the rectified linear unit (ReLU) activation function, theGaussian process covariance function is obtained analyti-cally (Cho & Saul, 2009). For other nonlinear activationfunctions, Lee et al. (2018) use a numerical approximationalgorithm. This Gaussian process approximation is usedfor instance by Hayou et al. (2019) for improving neuralnetworks training strategies. Novak et al. (2019) extend theresults by proving the Gaussian process limit for convolu-tional neural networks.

Various distributional properties are also studied in NNsregularization methods. The dropout technique (Srivastavaet al., 2014) was reinterpreted as a form of approximateBayesian variational inference (Kingma et al., 2015; Gal &Ghahramani, 2016). While Gal & Ghahramani (2016) builta connection between dropout and the Gaussian process,Kingma et al. (2015) proposed a way to interpret Gaussiandropout. They suggested variational dropout where eachweight of a model has its individual dropout rate. Sparsevariational dropout (Molchanov et al., 2017) extends vari-ational dropout to all possible values of dropout rates, andleads to a sparse solution. The approximate posterior ischosen to factorize either over rows or individual entriesof the weight matrices. The prior usually factorizes in thesame way, and the choice of the prior and its interactionwith the approximating posterior family are studied by Hronet al. (2018). Performing dropout can be used as a Bayesianapproximation but, as noted by Duvenaud et al. (2014), ithas no regularization effect on infinitely-wide hidden layers.

Recent work by Bibi et al. (2018) provides the expression ofthe first two moments of the output units of a one layer NN.


Obtaining the moments is a first step towards characterizingthe full distribution. However, the methodology of Bibi et al.(2018) is limited to the first two moments and to single-layerNNs, while we address the problem in more generality fordeep NNs.

3. Bayesian neural networks haveheavy-tailed deep units

The deep learning approach uses stochastic gradient descentand error back-propagation in order to fit the network pa-rameters (W(`))1≤`≤L, where ` iterates over all networklayers. In the Bayesian approach, the parameters are randomvariables described by probability distributions.

3.1. Assumptions on neural network

We assume a prior distribution on the model parameters,that are the weights W . In particular, let all weights (in-cluding biases) be independent and have zero-mean normaldistribution

W(`)i,j ∼ N (0, σ2

w), (2)

for all 1 ≤ ` ≤ L, 1 ≤ i ≤ H`−1 and 1 ≤ j ≤ H`, withfixed variance σ2

w. Given some input x, such prior distribu-tion induces by forward propagation (1) a prior distributionon the pre-nonlinearities and post-nonlinearities, whose tailproperties are the focus of this section. To this aim, thenonlinearity φ is required to span at least half of the realline as follows. We introduce an extended version of thenonlinearity assumption from Matthews et al. (2018a):

Definition 3.1 (Extended envelope property for nonlineari-ties). A nonlinearity φ : R→ R is said to obey the extendedenvelope property if there exist c1, c2 ≥ 0, d1, d2 > 0 suchthat the following inequalities hold

|φ(u)| ≥ c1 + d1|u| for all u ∈ R+ or u ∈ R−,|φ(u)| ≤ c2 + d2|u| for all u ∈ R.

(3)

The interpretation of this property is that φ must shoot toinfinity at least in one direction (R+ or R−, at least lin-early (first line of (3)), and also at most linearly (secondline of (3)). Of course, compactly supported nonlinearitiessuch as sigmoid and tanh do not satisfy the extended en-velope property but the majority of other nonlinearities do,including ReLU, ELU, PReLU, and SeLU.

We need to recall the definition of asymptotic equivalencebetween numeric sequences which we use to describe char-acterization properties of distributions:

Definition 3.2 (Asymptotic equivalence for sequences).Two sequences ak and bk are called asymptotic equivalentand denoted as ak � bk if there exist constants d > 0 and

D > 0 such that

d ≤ akbk≤ D, for all k ∈ N. (4)

The extended envelope property of a function yields thefollowing asymptotic equivalence:

Lemma 3.1. Let a nonlinearity φ : R → R obey the ex-tended envelope property. Then for any symmetric randomvariable X the following asymptotic equivalence holds

‖φ(X)‖k � ‖X‖k, for all k ≥ 1, (5)

where ‖X‖k =(E[|X|k]

)1/kis a k-th norm of X .

The proof can be found in the supplementary material.

3.2. Main theorem

This section postulates the rigorous result with a proofsketch. In the supplementary material one can find proofsof intermediate lemmas.

Firstly, we define the notion of sub-Weibull random vari-ables (Kuchibhotla & Chakrabortty, 2018; Vladimirova &Arbel, 2019).

Definition 3.3 (Sub-Weibull random variable). A randomvariable X satisfying for all x > 0 and for some θ > 0

P(|X| ≥ x) ≤ a exp(−x1/θ

), (6)

is called a sub-Weibull random variable with so-called tailparameter θ, which is denoted by X ∼ subW(θ).

Sub-Weibull distributions are characterized by tails lighterthan (or equally light as) Weibull distributions; in the sameway as sub-Gaussian or sub-exponential distributions corre-spond to distributions with tails lighter than Gaussian andexponential distributions, respectively. Sub-Weibull distri-butions are parameterized by a positive tail index θ and areequivalent to sub-Gaussian for θ = 1/2 and sub-exponentialfor θ = 1. To describe a tail lower bound through somesub-Weibull distribution family, i.e. a distribution of X tohave the tail heavier than some sub-Weibull, we define theoptimal tail parameter for that distribution as the positiveparameter θ characterized by:

‖X‖k � kθ. (7)

Then X is sub-Weibull distributed with optimal tail parame-ter θ, in the sense that for any θ′ < θ, X is not sub-Weibullwith tail parameter θ′ (see Vladimirova & Arbel, 2019, fora proof).

The following theorem postulates the main results.


Theorem 3.1 (Sub-Weibull units). Consider a feed-forwardBayesian neural network with Gaussian priors (2) and withnonlinearity φ satisfying the extended envelope condition ofDefinition 3.1. Then conditional on the input x, the marginalprior distribution2 induced by forward propagation (1) onany unit (pre- or post-nonlinearity) of the `-th hidden layeris sub-Weibull with optimal tail parameter θ = `/2. That isfor any 1 ≤ ` ≤ L, and for any 1 ≤ m ≤ H`,

U (`)m ∼ subW(`/2),

where a subW distribution is defined in Definition 3.3, andU

(`)m is either a pre-nonlinearity g(`)m or a post-nonlinearity

h(`)m .

Proof. The idea is to prove by induction with respect tohidden layer depth ` that pre- and post-nonlinearities satisfythe asymptotic moment equivalence

‖g(`)‖k � k`/2 and ‖h(`)‖k � k`/2.

The statement of the theorem then follows by the momentcharacterization of optimal sub-Weibull tail coefficient inEquation (7).

According to Lemma 1.1 from from the supplementary ma-terial, centering does not harm tail properties, then, forsimplicity, we consider zero-mean distributions W (`)

i,j ∼N (0, σ2

w).

Base step: Consider the distribution of the first hidden layerpre-nonlinearity g = g(1). Since weights Wm follow nor-mal distribution and x is a feature vector, then each hiddenunit W T

mx follow also normal distribution

g = W Tmx ∼ N (0, σ2

w‖x‖2).

Then, for normal zero-mean variable g, having varianceσ2 = σ2

w‖x‖2, holds the equality in sub-Gaussian propertywith variance proxy equals to normal distribution varianceand from Lemma 1.1 in the supplementary material:

‖g‖k �√k.

As activation function φ obeys the extended envelope prop-erty, nonlinearity moments are asymptotically equivalent tosymmetric variable moments

‖φ(g)‖k � ‖g‖k �√k.

It implies that first hidden layer post-nonlinearities h havesub-Gaussian distribution or sub-Weibull with tail parameterθ = 1/2 (Definition 3.3).

2We define the marginal prior distribution of a unit as its distri-bution obtained after all other units distributions are integrated out.Marginal is to be understood by opposition to joint, or conditional.

Inductive step: show that if the statement holds for ` − 1,then it also holds for `.

Suppose the post-nonlinearity of (` − 1)-th hidden layersatisfies the moment condition. Hidden units satisfy thenon-negative covariance theorem (Theorem 3.2):

Cov[(h(`−1)

)s,(h(`−1)

)t]≥ 0, for any s, t ∈ N.

Let the number of hidden units in (` − 1)-th layer equalsto H . Then according to Lemma 2.2 from the supple-mentary material, under assumption of zero-mean Gaus-sian weights, pre-nonlinearities of `-th hidden layer g(`) =∑Hi=1W

(`−1)m,i h

(`−1)i also satisfy the moment condition, but

with θ = `/2‖g(`)‖k � k`/2.

From the extended envelope property (Definition 3.1) post-nonlinearities h(`) satisfy the same moment condition aspre-nonlinearities g(`). This finishes the proof.

Remark 3.1. If the activation function φ is bounded, suchas the sigmoid or tanh, then the units are bounded. As aresult, by Hoeffding’s Lemma, they have a sub-Gaussiandistribution.Remark 3.2. Normalization techniques, such as batch nor-malization (Ioffe & Szegedy, 2015) or layer normaliza-tion (Ba et al., 2016), significantly reduce the training timein feed-forward neural networks. Normalization operationscan be decomposed into a set of elementary operations. Ac-cording to Proposition 1.4 from the supplementary material,elementary operations do not harm the distribution tail pa-rameter. Therefore, normalization methods do not have aninfluence on tail behavior.

3.3. Intermediate theorem

This section states with a proof sketch that the covariancebetween hidden units in the neural network is non-negative.

Theorem 3.2 (Non-negative covariance between hiddenunits). Consider the deep neural network described in, andwith the assumptions of, Theorem 3.1. The covariance be-tween hidden units of the same layer is non-negative. More-over, for given `-th hidden layer units h(`) and h(`), it holds

Cov[(h(`)

)s,(h(`)

)t]≥ 0, where s, t ∈ N.

For first hidden layer ` = 1 there is equality for all s and t.

Proof. A more detailed proof can be found in the supple-mentary material in Section 3.

Recall the covariance definition for random variables Xand Y

Cov [X,Y ] = E[XY ]− E[X]E[Y ]. (8)


The proof is based on induction with respect to the hiddenlayer number.

In the proof let us make notation simplifications: w`m =

W `m and w`mi = W `

mi for all m ∈ H`. If the index m isomitted, then w` is some the vectors w`

m, w`i is i-th elementof the vector w`

m.

1. First hidden layer. Consider the first hidden layer unitsh(1) and h(1). The covariance between units is equal to zeroand the units are Gaussian, since the weights w(1) and w(1)

are from N (0, σ2w) and independent. Thus, the first hidden

layer units are independent and its covariance (8) is equalto 0. Moreover, since h(1) and h(1) are independent, then(h(1)

)sand

(h(1)

)tare also independent.

2. Next hidden layers. Assume that the (` − 1)-th hiddenlayer has H`−1 hidden units, where ` > 1. Then the `-thhidden layer pre-nonlinearity is equal to

g(`) =

H`−1∑i=1

w(`)i h

(`−1)i . (9)

We want to prove that the covariance (8) between the `-thhidden layer pre-nonlinearities is non-negative. Let us showfirstly the idea of the proof in the case H`−1 = 1 and thenbriefly describe the proof for any finite H`−1 > 1, H`−1 ∈N.

2.1 One hidden unit. In the case H`−1 = 1, the covari-ance (8) sign is the same as of the expression

E[(h(`−1)

)2(s1+t1)]−E

[(h(`−1)

)2s1]E[(h(`−1)

)2t1],

since the weighs are zero-mean distributed, its momentsare equal to zero with an odd order. According to Jensen’sinequality for convex function f , we have E[f(x1, x2)] ≥f(E[x1],E[x2]). Since a function f(x1, x2) = x1x2 is con-vex for x1 ≥ 0 and x2 ≥ 0, then, taking x1 =

(h(`−1)

)2s1and x2 =

(h(`−1)

)2t1 , we have the condition we need (10)being satisfied.

2.1. H hidden units. Now let us consider the covariancebetween pre-nonlinearities (9) for H`−1 = H > 1. Raisethe sum in the brackets to the power

( H∑i=1

w(`)i h

(`−1)i

)s=

=

s∑sH=0

CsHs

(w

(`)H h

(`−1)H

)sH (H−1∑i=1

w(`)i h

(`−1)i

)s−sH.

And the same way for the second bracket(∑Hi=1 w

(`)i h

(`−1)i

)t. Notice that binomial terms

will be the same in the minuend and the subtrahend termsof (8). So the covariance in our notations can be written inthe form of

Cov

(H`−1∑i=1

w(`)i h

(`−1)i

)s,(H`−1∑i=1

w(`)i h

(`−1)i

)t =

=∑∑

C (E [AB]− E [A]E [B]) ,

where C-terms contain binomial coefficients,A-terms — allpossible products of hidden units in

(g(`))s

and B-terms —

all possible products of hidden units in(g(`))t

. In order forthe covariance to be non-negative, it is sufficient to show thatthe difference E [AB]− E [A]E [B] is non-negative. Sincethe weights are Gaussian and independent, we have thefollowing equation, omitting the superscript for simplicity,

E [AB] = WW · E[H∏i=1

hsi+tii

],

E [A]E [B] = WW · E[H∏i=1

hsii

]E

[H∏i=1

htii

],

where WW is the product of weights moments

WW =

H∏i=1

E [wsii ]E[wtii].

For WW not equal to zero, all the powers must be even.Now we need to prove

E

H/2∏i=1

h2(si+ti)i

≥ E

H/2∏i=1

h2sii

E

H/2∏i=1

h2tii

. (10)

According to Jensen’s inequality for convex functions, sincea function f(x1, x2) = x1x2 is convex for x1 ≥ 0 andx2 ≥ 0, then, taking x1 =

∏H/2i=1 h

2sii and x2 =

∏H/2i=1 h

2tii ,

the condition from (10) is satisfied.

3. Post-nonlinearities.

Let show the proof for the ReLU nonlinearity.

The distribution of the `-th hidden layer pre-nonlinearity g(`)

is the sum of symmetric distributions, which are products ofGaussian variables w(`) and the non-negative ReLU output,i.e. the (` − 1)-th hidden layer post-nonlinearity h(`−1).Therefore, g(`) follows a symmetric distribution and thefollowing inequality∫ +∞

−∞

∫ +∞

−∞gg′ p(g, g′) dg dg′ ≥

≥∫ +∞

−∞g p(g) dg ·

∫ +∞

−∞g′ p(g′) dg′


implies the same inequality for a positive part∫ +∞

0

∫ +∞

0

gg′ p(g, g′) dg dg′ ≥

≥∫ +∞

0

g p(g) dg ·∫ +∞

0

g′ p(g′) dg′.

Notice that the equality above is the ReLU function outputand for a symmetric distribution we have∫ +∞

0

x p(x) dx =1

2E [|X|] . (11)

That means if the non-negative covariance is proven for pre-nonlinearities, for post-nonlinearities it is also non-negative.We omit the proof for the other nonlinearities with the ex-tended envelope property, since instead of precise equa-tion (11), the asymptotic equivalence for moments will beused for a positive part and for a negative part — preciseexpectation expressions which depend on certain nonlinear-ity.

3.4. Convolutional neural networks

Convolutional neural networks (Fukushima & Miyake,1982; LeCun et al., 1998) are a particular kind of neural net-work for processing data that has a known grid-like topology,which allows to encode certain properties into the architec-ture. These then make the forward function more efficientto implement and vastly reduce the amount of parameters inthe neural network. Neurons in such networks are arrangedin three dimensions: width, height and depth. There arethree main types of layers that can be concatenated in thesearchitectures: convolutional, pooling, and fully-connectedlayers (exactly as seen in standard NNs). The convolutionallayer computes dot products between a region in the inputsand its weights. Therefore, each region can be considered asa particular case of a fully-connected layer. Pooling layerscontrol overfitting and computations in deep architectures.They operate independently on every slice of the input andreduces it spatially. The most commonly functions used inpooling layers are max pooling and average pooling.

Proposition 3.1. The operations: 1. max pooling and 2.averaging do not modify the optimal tail parameter θ ofsub-Weibull family. Consequently, the result of Theorem 3.1carries over to convolutional neural networks.

The proof can be found in the supplementary material.

Corollary 3.1. Consider a convolutional neural networkcontaining convolutional, pooling and fully-connected lay-ers under assumptions from Section 3.1. Then a unit of `-thhidden layer has sub-Weibull distribution with optimal tailparameter θ = `/2, where ` is the number of convolutionaland fully-connected layers.

Proof. Proposition 3.1 implies that the pooling layer keepsthe tail parameter. From discussion at the beginning ofthe section, the result of Theorem 3.1 is also applied toconvolutional neural networks where the depth is consideredas the number of convolutional and fully-connected layers.

4. Regularization scheme on the unitsOur main theoretical contribution, Theorem 3.1, character-izes the marginal prior distribution of the network units asfollows: when the depth increases, the distribution becomesmore heavy-tailed. In this section, we provide an interpre-tation of the result in terms of regularization at the level ofthe units. To this end, we first briefly recall shrinkage andpenalized estimation methods.

4.1. Short digest on penalized estimation

The notion of penalized estimation is probably best illus-trated on the simple linear regression model, where the aimis to improve prediction accuracy by shrinking, or evenputting exactly to zero, some coefficients in the regression.Under these circumstances, inference is also more inter-pretable since, by reducing the number of coefficients ef-fectively used in the model, it is possible to grasp its salientfeatures. Shrinking is performed by imposing a penalty onthe size of the coefficients, which is equivalent to allow-ing for a given budget on their size. Denote the regressionparameter by β ∈ Rp, the regression sum-of-squares byR(β), and the penalty by λL(β), where L is some norm onRp and λ some positive tuning parameter. Then, the twoformulations of the regularized problem

minβ∈Rp

R(β) + λL(β), and

minβ∈Rp

R(β) subject to L(β) ≤ t,

are equivalent, with some one-to-one correspondence be-tween λ and t, and are respectively termed the penalty andthe constraint formulation. This latter formulation providesan interesting geometrical intuition of the shrinkage mech-anism: the constraint L(β) ≤ t reads as imposing a totalbudget of t for the parameter size in terms of the norm L. Ifthe ordinary least squares estimator βols lives in the L-ballwith surface L(β) = t, then there is no effect on the esti-mation. In contrast, when βols is outside the ball, then theintersection of the lowest level curve of the sum-of-squaresR(β) with the L-ball defines the penalized estimator.

The choice of the L norm has considerable effects on theproblem, as can be sensed geometrically. Consider for in-stance Lq norms, with q ≥ 0. For any q > 1, the associatedLq norm is differentiable and contours have a round shapewithout sharp angles. In that case, the penalty effect is to


shrink the β coefficients towards 0. The most well-knownestimator falling in this class is the ridge regression obtainedwith q = 2, see Figure 2 top-left panel. In contrast, for anyq ∈ (0, 1], the Lq norm has some non differentiable pointsalong the axis coordinates, see Figure 2 top-right and bot-tom panels. Such critical points are more likely to be hit bythe level curves of the sum-of-squares R(β), thus settingexactly to zero some of the parameters. A very successfulapproach in this class is the Lasso obtained with q = 1.Note that the problem is computationally much easier in theconvex situation which occurs only for q ≥ 1.

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 1

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 2

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 3

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

U1

U2

Layer 10

Figure 2. L2/`-norm unit balls (in dimension 2) for layers ` =1, 2, 3 and 10.

4.2. MAP on weights W is weight decay

These penalized methods have a simple Bayesian coun-terpart in the form of the maximum a posteriori (MAP)estimator. In this context, the objective function R is thenegative log-likelihood, while the penalty L is the negativelog-prior. The objective function takes on the form of sum-of-squared errors for regression under Gaussian errors, andof cross-entropy for classification.

For neural networks, it is well-known that an independentGaussian prior on the weights

π(W ) ∝L∏`=1

∏i,j

e−12 (W

(`)i,j )2 , (12)

is equivalent to the weight decay penalty, also known as

ridge regression:

L(W ) =

L∑`=1

∑i,j

(W(`)i,j )2 = ‖W ‖22, (13)

where products in (12) and sums in (13) involving i and jabove are over 1 ≤ i ≤ H`−1 and 1 ≤ j ≤ H`, H0 and HL

representing respectively the input and output dimensions.

4.3. MAP on units U

Now moving the point of view from weights to units leadsto a radically different shrinkage effect. Let U (`)

m denote them-th unit of the `-th layer (either pre- or post-nonlinearity).We prove in Theorem 3.1 that conditional on the input x, aGaussian prior on the weights translates into some prior onthe units U (`)

m that is marginally sub-Weibull with optimaltail index θ = `/2. This means that the tails of U (`)

m satisfy

P(|U (`)m | ≥ u) ≤ exp

(−u2/`/K1

)for all u ≥ 0, (14)

for some positive constant K1. The exponent of u in theexponential term above is optimal in the sense that Equa-tion (14) is not satisfied with some parameter θ′ smallerthan `/2. Thus, the marginal density of U (`)

m on R is approx-imately proportional to

π(`)m (u) ≈ e−|u|

2/`/K1 . (15)

The joint prior distribution for all the units U =

(U(`)m )1≤`≤L,1≤m≤H`

can be expressed from all themarginal distributions by Sklar’s representation theo-rem (Sklar, 1959) as

π(U) =

L∏`=1

H∏m=1

π(`)m (U (`)

m )C(F (U)), (16)

where C represents the copula of U (which characterizesall the dependence between the units) while F denotes itscumulative distribution function. The penalty incurred bysuch a prior distribution is obtained as the negative log-prior,

L(U) = −L∑`=1

H∑m=1

log π(`)m (U (`)

m )− logC(F (U)),

(a)≈L∑`=1

H∑m=1

|U (`)m |2/` − logC(F (U)),

≈ ‖U (1)‖22 + ‖U (2)1 ‖1 + · · ·+ ‖U (L)‖2/L2/L

− logC(F (U)), (17)

where (a) comes from (15). The firstL terms in (17) indicatethat some shrinkage operates at every layer of the network,with a penalty term that approximately takes the form ofthe L2/` norm at layer `. Thus, the deeper the layer, thestronger the regularization induced at the level of the units,as summarized in Table 1.


Layer Penalty on W Approximate penalty on U

1 ‖W (1)‖22, L2 ‖U (1)‖22 L2 (weight decay)

2 ‖W (2)‖22, L2 ‖U (2)‖ L1 (Lasso)

` ‖W (`)‖22, L2 ‖U (`)‖2/`2/` L2/`

Table 1. Comparison of Bayesian neural network penalties onweights W and units U .

5. ExperimentsWe illustrate the result of Theorem 3.1 on a 100 layersMLP. The hidden layers of neural network haveH1 = 1000,H2 = 990, H3 = 980, . . . , H` = 1000 − 10(` − 1), . . . ,H100 = 10 hidden units, respectively. The input x is avector of features from R104 . Figure 3 represents the tailsof first three, 10th and 100th hidden layers pre-nonlinearitymarginal distributions in logarithmic scale. Units of onelayer have the same sub-Weibull distribution since theyshare the same input and prior on the corresponding weights.The curves are obtained as histograms from a sample of size105 from the prior on the pre-nonlinearities, which is it-self obtained by sampling 105 sets of weights W from theGaussian prior (2) and forward propagation via (1). Theinput vector x is sampled with independent features froma standard normal distribution once for all at the start. Thenonlinearity φ is the ReLU function. Being a linear combi-nation involving symmetric weights W , pre-nonlinearitiesg also have a symmetric distribution, thus we visualize onlytheir distribution on R+.

Figure 3 corroborates our main result. On the one hand, theprior distribution of the first hidden units is Gaussian (greencurve), which corresponds to a subW(1/2) distribution. Onthe other hand, deeper layers are characterized by heavier-tailed distributions. The deepest considered layer (100th,violet curve) has an extremely flat distribution, which corre-sponds to a subW(50) distribution.

6. Conclusion and future workDespite the ubiquity of deep learning throughout science,medicine and engineering, the underlying theory has notkept pace with applications for deep learning. In this paper,we have extended the state of knowledge on Bayesian neuralnetworks by providing a characterization of the marginalprior distribution of the units. Matthews et al. (2018a) andLee et al. (2018) proved that unit distributions have a Gaus-sian process limit in the wide regime, i.e. when the numberof hidden units tends to infinity. We showed that they areheavier-tailed as depth increases, and discussed this result interms of a regularizing mechanism at the level of the units.We anticipate that the Gaussian process limit of sub-Weibull

0 10 20 30 40 50 60 70x

10−3

10−2

10−1

100

logP

(X≥x

)

subW(50)subW(5)subW(3/2)subW(1)subW(1/2)

Figure 3. Illustration of layers ` = 1, 2, 3, 10 and 100 hidden units(pre-nonlinearities) marginal prior distributions. They correspondrespectively to subW(1/2), subW(1), subW(3/2), subW(5) andsubW(50).

distributions in a given layer for increasing width could bealso recovered through a modification of the Central LimitTheorem for heavy-tailed distributions, see Kuchibhotla &Chakrabortty (2018).

Since initialization and learning dynamics are key in modernmachine learning in order to properly tune deep learning al-gorithms, a good implementation practice requires a properunderstanding of the prior distribution at play and of theregularization it incurs.

We hope that our results will open avenues for further re-search. Firstly, Theorem 3.1 regards the marginal priordistribution of the units, while a full characterization of thejoint distribution of all units U remains an open question.More specifically, a precise description of the copula de-fined in Equation (16) would provide valuable informationabout the dependence between the units, and also aboutthe precise geometrical structure of the balls induced bythat penalty. Secondly, the interpretation of our result (Sec-tion 4) is concerned with the maximum a posteriori of theunits, which is a point estimator. One of the benefits of theBayesian approach to neural networks lies in its ability toprovide a principled approach to uncertainty quantification,so that an interpretation of our result in terms of the fullposterior distribution would be very appealing. Lastly, thepractical potentialities of our results are many: to bettercomprehend the regularizing mechanisms in deep neuralnetworks will contribute to design and understand strategiesto avoid overfitting and improve generalization.


AcknowledgementsWe would like to thank Stephane Girard for fruitful dis-cussions on Weibull-like distributions and Cedric Fevottefor pointing out the potential relationship of our heavy-tailresult with sparsity-inducing priors.

ReferencesBa, J., Kiros, J., and Hinton, G. Layer normalization. arXiv

preprint arXiv:1607.06450, 2016.

Bibi, A., Alfadly, M., and Ghanem, B. Analytic expressionsfor probabilistic moments of PL-DNN with Gaussianinput. In CVPR, 2018.

Cho, Y. and Saul, L. Kernel methods for deep learning. InNeurIPS, 2009.

Damianou, A. and Lawrence, N. Deep Gaussian processes.In AISTATS, 2013.

Duvenaud, D., Rippel, O., Adams, R., and Ghahramani, Z.Avoiding pathologies in very deep networks. In AISTATS,2014.

Fukushima, K. and Miyake, S. Neocognitron: A self-organizing neural network model for a mechanism ofvisual pattern recognition. In Competition and Coopera-tion in Neural Nets. Springer, 1982.

Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approxi-mation: Representing model uncertainty in deep learning.In ICML, 2016.

Goodfellow, I., Bengio, Y., and Courville, A. Deep learning.MIT press, 2016.

Graves, A., Mohamed, A., and Hinton, G. Speech recogni-tion with deep recurrent neural networks. In ICASSP, pp.6645–6649, 2013.

Hayou, S., Doucet, A., and Rousseau, J. On the impact ofthe activation function on deep neural networks training.arXiv preprint arXiv:1902.06853, 2019.

Hron, J., Matthews, A., and Ghahramani, Z. VariationalBayesian dropout: pitfalls and fixes. In ICML, 2018.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In ICML, 2015.

Kingma, D., Salimans, T., and Welling, M. Variationaldropout and the local reparameterization trick. InNeurIPS, 2015.

Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNetclassification with deep convolutional neural networks.In NeurIPS, 2012.

Krogh, A. and Hertz, J. A simple weight decay can improvegeneralization. In NeurIPS, 1991.

Kuchibhotla, A. K. and Chakrabortty, A. Moving beyondsub-Gaussianity in high-dimensional statistics: Applica-tions in covariance estimation and linear regression. arXivpreprint arXiv:1804.02605, 2018.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R.,Schoenholz, S., and Bahri, Y. Deep neural networksas Gaussian processes. In ICML, 2018.

Liang, F., Li, Q., and Zhou, L. Bayesian neural networks forselection of drug sensitive genes. Journal of the AmericanStatistical Association, 113(523):955–972, 2018.

MacKay, D. A practical Bayesian framework for backprop-agation networks. Neural Computation, 4(3):448–472,1992.

Matthews, A., Rowland, M., Hron, J., Turner, R., andGhahramani, Z. Gaussian process behaviour in widedeep neural networks. arXiv, 1804.11271, 2018a.

Matthews, A., Rowland, M., Hron, J., Turner, R., andGhahramani, Z. Gaussian process behaviour in widedeep neural networks. In ICLR, 2018b.

Molchanov, D., Ashukha, A., and Vetrov, D. Variationaldropout sparsifies deep neural networks. In ICML, 2017.

Neal, R. Bayesian training of backpropagation networksby the hybrid Monte Carlo method. Technical ReportCRG-TR-92-1, University of Toronto, 1992.

Neal, R. Bayesian learning for neural networks. Springer,1996.

Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Hron,J., Abolafia, D., Pennington, J., and Sohl-Dickstein, J.Bayesian deep convolutional networks with many chan-nels are Gaussian processes. In ICLR, 2019.

Polson, N. G. and Sokolov, V. Deep learning: A Bayesianperspective. Bayesian Analysis, 12(4):1275–1304, 2017.

Rasmussen, C. and Williams, C. Gaussian Processes forMachine Learning. MIT Press, 2006.

Saatci, Y. and Wilson, A. Bayesian GAN. In NeurIPS, 2017.

Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L.,van den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., and Lanctot, M. Mastering the gameof go with deep neural networks and tree search. Nature,529(7587):484–489, 2016.

http://mistis.inrialpes.fr/people/girard/

https://www.irit.fr/~Cedric.Fevotte/

http://arxiv.org/abs/1607.06450




Sklar, M. Fonctions de repartition an dimensions et leursmarges. Publ. inst. statist. univ. Paris, 8:229–231, 1959.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: A simple way to preventneural networks from overfitting. Journal of MachineLearning Research, 15(1):1929–1958, 2014.

Tibshirani, R. Regression shrinkage and selection via theLasso. Journal of the Royal Statistical Society. Series B(Methodological), pp. 267–288, 1996.

Vladimirova, M. and Arbel, J. Sub-Weibull distributions:generalizing sub-Gaussian and sub-Exponential proper-ties to heavier-tailed distributions. Preprint, 2019.

A. Additional technical resultsLemma 3.1 proof.

Proof. According to asymptotic equivalence definition theremust exist positive constants d andD such that for all k ∈ Nit holds

d ≤ ‖φ(X)‖k/‖X‖k ≤ D. (18)

The extended envelope property upper bound and the trian-gle inequality for norms imply the right-hand side of (18),since

‖φ(X)‖k ≤ ‖c2 + d2|u|‖k ≤ c2 + d2‖u‖k.

Assume that |φ(u)| ≥ c1 + d1|u| for u ∈ R+. Consider thelower bound of the nonlinearity moments

‖φ(X)‖k ≥ ‖d1u+‖k + c1 + ‖φ(u−)‖k,

where {u− : u ∈ R−} and {u+ : u ∈ R+}. For negativeu− there are constants c1 ≥ 0 and d1 > 0 such that c1 −d1u > |φ(u)|, or c1 > |φ(u)|+ d1u:

‖φ(X)‖k > ‖d1u+‖k + ‖|φ(u−)|+ d1u−‖k ≥ d1‖u‖k.

It yields asymptotic equivalence (5).

Proposition 3.1 proof.

Proof. Let Xi ∼ subW(θ) for 1 ≤ i ≤ N be units fromone region where pooling operation is applied. Using Def-inition 3.3, for all x ≥ 0 and some constant K > 0 wehave

P(|Xi| ≥ x) ≤ exp(−x1/θ/K

)for all i.

Max pooling operation takes the maximum element in theregion. SinceXi, 1 ≤ i ≤ N are the elements in one region,

we want to check if the tail of max1≤i≤N Xi obeys sub-Weibull property with optimal tail parameter is equal to θ.Since max pooling operation can be decomposed into linearand ReLU operations, which does not harm the distributiontail (Lemma 3.1), it leads to the proposition statement firstpart.

Summation and division by a constant does not influencethe distribution tail, yielding the proposition result regardingthe averaging operation.

Lemma A.1 (Gaussian moments). Let X be a normal ran-dom variable such that X ∼ N (0, σ2), then the followingasymptotic equivalence holds

‖X‖k �√k.

Proof. The moments of central normal absolute randomvariable |X| are equal to

E[|X|k] =

∫R|x|k p(x) dx

= 2

∫ ∞0

xk p(x) dx

=1√πσk2k/2Γ

(k + 1

2

). (19)

By the Stirling approximation of the Gamma function:

Γ(z) =

√2π

z

(ze

)z (1 +O

(1

z

)). (20)

Substituting (20) into the central normal absolute moment(19), we obtain

E[|X|k] =σk2k/2√

π

√4π

k + 1

(k + 1

2e

) k+12(

1 +O

(1

k

))=

2σk√2e

(k + 1

e

)k/2(1 +O

(1

k

)).

Then the roots of absolute moments can be written in theform of

‖X‖k =σ

e1/(2k)

√k + 1

e

(1 +O

(1

k

))1/k

=σ√e

√k + 1

e1/(2k)

(1 +O

(1

k2

))=

σ√eck√k + 1.

Here the coefficient ck denotes

ck =1

e1/(2k)

(1 +O

(1

k2

))→ 1,


with k →∞. Thus, asymptotic equivalence holds

‖X‖k �√k + 1 �

√k.

Lemma A.2 (Multiplication moments). Let W and X beindependent random variables such that W ∼ N (0, σ2)and for some p > 0 it holds

‖X‖k � kp. (21)

Let Wi be independent copies of W , and Xi be copiesof X , i = 1, . . . ,H with non-negative covariance betweenmoments of copies

Cov[Xsi , X

tj

]≥ 0, for i 6= j, s, t ∈ N. (22)

Then we have the following asymptotic equivalence∥∥∥ H∑i=1

WiXi

∥∥∥k� kp+1/2. (23)

Proof. Let us proof the statement, using mathematical in-duction.

Base case: show that the statement is true for H = 1. Forindependent variables W and X , we have

‖WX‖k =(E[|WX|k]

)1/k=(E[|W |k]E[|X|k]

)1/k= ‖W‖k‖X‖k.

(24)

Since the random variable W follows Gaussian distribution,then Lemma A.1 implies

‖W‖k �√k. (25)

Substituting assumption (21) and weight norm asymptoticequivalence (25) into (24) leads to the desired asymptoticequivalence (23) in case of H = 1.

Inductive step: show that if for H = n− 1 the statementholds, then for H = n it also holds.

Suppose for H = n− 1 we have∥∥∥n−1∑i=1

WiXi

∥∥∥k� kp+1/2. (26)

Then, according to the covariance assumption (22), forH =n we get∥∥∥ n∑

i=1

WiXi

∥∥∥kk

=∥∥∥n−1∑i=1

WiXi +WnXn

∥∥∥kk

(27)

≥k∑j=0

Cjk

∥∥∥n−1∑i=1

WiXi

∥∥∥jj

∥∥∥WnXn

∥∥∥k−jk−j

.

(28)

Using the equivalence definition (Def. 3.2), from the in-duction assumption (26) for all j = 0, . . . , k there existsabsolute constant d1 > 0 such that∥∥∥n−1∑

i=1

WiXi

∥∥∥jj≥(d1 j

p+1/2)j. (29)

Recalling previous equivalence results in the base case, thereexists constant m2 > 0 such that∥∥∥WnXn

∥∥∥k−jk−j≥(d2 (k − j)p+1/2

)k−j. (30)

Substitute obtained bounds (29) and (30) into equation (27)with denoted d = min{d1, d2}, obtain

∥∥∥ n∑i=1

WiXi

∥∥∥kk≥ dk

k∑j=0

Cjk[jj (k − j)k−j

]p+1/2

= dk kk(p+1/2)k∑j=0

Cjk

[( jk

)j(1− j

k

)k−j]p+1/2

.

(31)

Notice the lower bound of the following expression

k∑j=0

Cjk

[( jk

)j(1− j

k

)k−j]p+1/2

≥k∑j=0

[( jk

)j(1− j

k

)k−j]p+1/2

≥ 2. (32)

Substituting found lower bound (32) into (31), get∥∥∥ n∑i=1

WiXi

∥∥∥kk≥ 2 dk kk(p+1/2) > dk kk(p+1/2). (33)

Now prove the upper bound. For random variables Y andZ the Holder’s inequality holds

‖Y Z‖1 = E [|Y Z|] ≤(E[|Y |2

]E[|Z|2

])1/2= ‖Y Z‖2‖Y Z‖2.

Holder’s inequality leads to the inequality for Lk norm

‖Y X‖kk ≤ ‖Y ‖k2k‖Z‖k2k. (34)

Obtain the upper bound of ‖∑ni=1WiXi‖kk from

the norm property (34) for the random variables Y =(∑n−1i=1 WiXi

)k−jand Z = (WnXn)

j

∥∥∥ n∑i=1

WiXi

∥∥∥kk

=∥∥∥n−1∑i=1

WiXi +WnXn

∥∥∥kk

(35)

≤k∑j=0

Cjk

∥∥∥n−1∑i=1

WiXi

∥∥∥j2j

∥∥∥WnXn

∥∥∥k−j2(k−j)

.

(36)


From the induction assumption (26) for all j = 0, . . . , kthere exists absolute constant D1 > 0 such that

∥∥∥n−1∑i=1

WiXi

∥∥∥j2j≤(D1 (2j)p+1/2

)j. (37)

Recalling previous equivalence results in the base case, thereexists constant D2 > 0 such that∥∥∥WnXn

∥∥∥k−j2(k−j)

≤(D2

(2(k − j)

)p+1/2)k−j

. (38)

Substitute obtained bounds (37) and (38) into equation (35)with denoted D = max{D1, D2}, obtain

∥∥∥ n∑i=1

WiXi

∥∥∥kk≤ Dk

k∑j=0

Cjk

[(2j)j(

2(k − j))k−j]p+1/2

.

Find an upper bound for[(

1− jk

)k−j ( jk

)j]p+1/2

. Since

expressions(1− j

k

)and

(jk

)are less than 1, then[(

1− jk

)k−j ( jk

)j]p+1/2

< 1 holds for all natural num-bers p > 0. For the sum of binomial coefficients it holds theinequality

∑kj=0 C

jk < 2k. So the final upper bound is

∥∥∥ n∑i=1

WiXi

∥∥∥kk≤ 2kDk (2k)k(p+1/2). (39)

Hence, taking the k-th root of (33) and (39), we have upperand lower bounds which imply the equivalence for H = nand the truth of inductive step

d′kp+1/2 ≤∥∥∥ n∑i=1

WiXi

∥∥∥k≤ D′kp+1/2,

where d′ = d and D′ = 2p+3/2D. Since both the base caseand the inductive step have been performed, by mathemati-cal induction the equivalence holds for all H ∈ N

∥∥∥ H∑i=1

WiXi

∥∥∥k� kp+1/2.

understanding priors in bayesian neural networks at the ... · we investigate deep bayesian neural...

Documents