unsupervised models of images by spike-and-slab rbmslisa/pointeurs/icml2011_ssrbm.pdf · the spike...

Unsupervised Models of Images by Spike-and-Slab RBMs

Aaron Courville [email protected] Bergstra [email protected] Bengio [email protected]

DIRO, Universite de Montreal, Montreal, Quebec, Canada

Abstract

The spike and slab Restricted BoltzmannMachine (RBM) is defined by having botha real valued “slab” variable and a binary“spike” variable associated with each unit inthe hidden layer. In this paper we generalizeand extend the spike and slab RBM to in-clude non-zero means of the conditional dis-tribution over the observed variables giventhe binary spike variables. We also introducea term, quadratic in the observed data thatwe exploit to guarantee that all condition-als associated with the model are well de-fined – a guarantee that was absent in theoriginal spike and slab RBM. The inclusionof these generalizations improves the perfor-mance of the spike and slab RBM as a fea-ture learner and achieves competitive perfor-mance on the CIFAR-10 image classificationtask. The spike and slab model, when trainedin a convolutional configuration, can gener-ate sensible samples that demonstrate thatthe model has captured the broad statisticalstructure of natural images.

1. Introduction

Recently, there has been considerable interest in theproblem of unsupervised learning of features for su-pervised tasks in natural image domains. Approachesbased on unsupervised pretraining followed by eitherwhole-model finetuning or simply the linear classifi-cation of features dominate in benchmark tasks suchas CIFAR-10 (Krizhevsky, 2009). One of most pop-ular energy-based modelling paradigms for unsuper-vised feature learning is the Restricted Boltzmann Ma-chine (RBM). An RBM is a Markov random field with

Appearing in Proceedings of the 28 th International Con-ference on Machine Learning, Bellevue, WA, USA, 2011.Copyright 2011 by the author(s)/owner(s).

a bipartite graph structure consisting of a visible layerand a hidden layer. The bipartite structure excludesconnections between the variables within each layerso that the latent variables are conditionally indepen-dent given the visible variables and vice versa. Thefactorial nature of these conditional distributions en-ables efficient Gibbs sampling which forms the basis ofthe most widespread RBM learning algorithms suchas contrastive divergence (Hinton, 2002) and stochas-tic maximum likelihood (Tieleman, 2008).

The spike and slab RBM (ssRBM) (Courville et al.,2011) departs from other similar RBM-based modelsin the way the hidden layer latent units are defined.They are modelled as the element-wise product of areal valued vector with a binary vector, i.e., each hid-den unit is associated with a binary spike variable anda real-valued slab variable. The name spike and slabis inspired from terminology in the statistics litera-ture (Mitchell & Beauchamp, 1988), where the termrefers to a prior consisting of a mixture between twocomponents: the spike, a discrete probability mass atzero; and the slab, a density (typically uniformly dis-tributed) over a continuous domain.

In this paper, we introduce a generalization of the ss-RBM model, which we refer to as the µ-ssRBM. Rel-ative to the original ssRBM, the µ-ssRBM includesadditional terms in the energy function which give ex-tra modelling capacity. One of the additional termsallows the model to form a conditional distributionof the spike variables (by marginalizing out the slabvariables, given an observation) that is similar to thecorresponding conditional of the recently introducedmean covariance RBM (mcRBM) (Ranzato & Hinton,2010) and mPoT model (Ranzato et al., 2010). Con-ditional on both the observed and spike variables, theµ-ssRBM slab variables and input are jointly Gaus-sian with diagonal covariance matrix; conditional onboth the spike and slab variables, the observationsare Gaussian with diagonal covariance. Thus, unlikethe mcRBM or the more recent mPoT model, the µ-


ssRBM is amenable to simple and efficient Gibbs sam-pling. This property of the ssRBM makes the modelan excellent candidate as a building block for the de-velopment of more sophisticated models such as DeepBoltzmann Machines (Salakhutdinov & Hinton, 2009).

One potential drawback of the ssRBM is the lack of aguarantee that the resulting model constitutes a validdensity over the whole real-valued data space. In thispaper, we develop several strategies that guarantee allconditionals are well defined by adding energy termsto the µ-ssRBM. However, experimentally we find thatloosening the constraint yields better models.

2. The µ-Spike-and-Slab RBM

The µ-ssRBM describes the interaction between threerandom vectors: the visible vector v representing theobserved data, the binary “spike” variables h and thereal-valued “slab” variables s. The ith hidden unit isassociated both with an element hi of the binary vec-tor and with an element si of the real-valued variable.Suppose there are N hidden units: h ∈ [0, 1]N , s ∈ RNand a visible vector of dimension D: v ∈ RD. Theµ-ssRBM model is defined via the energy function:

E(v, s, h) = −N∑i=1

vTWisihi +1

2vT(

Λ +

N∑i=1

Φihi

)v

+1

2

N∑i=1

αis2i −

N∑i=1

αiµisihi −N∑i=1

bihi +

N∑i=1

αiµ2ihi,

(1)

in which Wi denotes the ith weight vector (Wi ∈ RD),each bi is a scalar bias associated with hi, each αi isa scalar that penalizes large values of s2i , and Λ is adiagonal matrix that penalizes large values of ‖v‖22. Incomparison with the original ssRBM (Courville et al.,2011), the µ-ssRBM energy function includes three ad-

ditional terms. First, the 1/2 vT(∑N

i=1 Φihi

)v term,

with non-negative diagonal matrices Φi, i ∈ [1, N ], es-tablishes an h-dependent quadratic penalty on v. Sec-ond, associated with each slab variable is a mean pa-rameter µ – from which the µ-ssRBM takes its name.Finally, the

∑Ni=1 αiµ

2ihi term acts as an additional

bias term for the hi, which we include to simplifythe parametrization of the conditionals. In additionto offering additional flexibility to model the statis-tics of natural images, the inclusion of the parametersµ = [µ1, . . . , µN ] and Φ = [Φ1, . . . ,ΦN ] also allows usto derive constraints on the model that ensure thatthe model remains well-behaved over the entire datadomain of RD.

The joint probability distribution over v, s =

[s1, . . . , sN ] and h = [h1, . . . , hN ] is defined as:

p(v, s, h) =1

Zexp {−E(v, s, h)} (2)

where Z is the normalizing partition function. We canthink of the distribution presented in Eqns. 1 and 2as associated with the standard RBM bipartite graphstructure with the distinction that the hidden layer iscomposed of an element-wise product of the randomvectors s and h.

To gain insight into the µ-ssRBM model, we will lookat the conditional distributions p(v | s, h), p(s | v, h),P (h | v) and p(v | h). First, we consider the condi-tional p(v | s, h):

p(v | s, h) =1

p(s, h)

1

Zexp {−E(v, s, h)}

= N

(Cv|s,h

N∑i=1

Wisihi , Cv|s,h

)(3)

where Cv|s,h =(

Λ +∑Ni=1 Φihi

)−1. The conditional

distribution of v given both s and h is a Gaussian withmean Cv|s,h

∑Ni=1Wisihi and covariance Cv|s,h. Since

Λ and Φi are diagonal (∀ i ∈ [1, N ]), the covariancematrix of p(v | s, h) is also diagonal. Eqn. 3 shows therole played by the Φi in augmenting the precision withthe activation of hi. Indeed, hidden unit i contributesa component not only to the mean proportional toWisi, but also to the global scaling of the conditionalmean.

The conditional over the slab variables s given thespike variables h and the visible units v is given by:

p(s | v, h) =

N∏i=1

N((α−1i vTWi + µi

)hi , α

−1i

). (4)

As was the case with the conditional p(v | s, h), de-riving the conditional p(s | v, h) from the joint distri-bution in Eqn. 2 reveals a Gaussian distribution withdiagonal covariance. Eqn. 4 also shows how the meanof the slab variable si, given hi = 1, is linearly depen-dent on v, and as the precision αi → ∞, si convergesin probability to µi.

Marginalizing out the slab variables s yields the tra-ditional RBM conditionals p(v | h) and p(h | v). Theconditional p(v | h) is also Gaussian,

p(v | h) =1

P (h)

1

Z

∫exp {−E(v, s, h)} ds

= N

(Cv|h

N∑i=1

Wiµihi , Cv|h

)(5)

where Cv|h =(

Λ +∑Ni=1 Φihi −

∑Ni=1 α

−1i hiWiW

Ti

)−1

,

the last equality holds only if the covariance matrix


Cv|h is positive definite. Note that the covariance isnot obviously parametrized to guarantee that it is pos-itive definite. In Section 3, we will discuss strategiesto ensure that Cv|h be positive definite via constraintson Λ and Φ.

In marginalizing over s, the visible vector v remainsGaussian-distributed, but the parametrization haschanged. While the distribution p(v | s, h) uses h withs to parametrize the conditional mean with a corre-sponding diagonal covariance, the conditional p(v | h)uses h to parametrize a covariance matrix that is non-diagonal due to the

∑Ni=1 α

−1i hiWiW

Ti term, and a

conditional mean mediated by µ.

A closer look at Eqn. 5 reveals an important aspectof the inductive bias of the model. The conditionalmean of v given h and principal axis of conditionalcovariance are generally in a similar direction, and

if(

Λ +∑Ni=1 Φihi

)is a scalar matrix (equivalent to

scalar × Identity) the two vectors are in exactly thesame direction. Having the principal component ofthe conditional covariance in the same direction as themean has the property of p(v | h) being maximally in-variant to changes in the norm ‖v‖2. This is a desirableproperty for a model of natural images where the normis particularly sensitive to illumination conditions andimage contrast levels – factors that are often irrelevantto tasks of interest such as object classification.

The final conditional that we will consider is the dis-tribution over the latent spike variables h given thevisible vector, P (h | v) =

∏Ni P (hi | v) and

P (hi = 1 | v) =1

p(v)

1

Zi


= σ

(1

2α−1i (vTWi)

2 + vTWiµi −1

2vTΦiv + bi

), (6)

where σ represents a logistic sigmoid. As with the con-ditionals p(v | s, h) and p(s | v, h), the distribution of hgiven v factorizes over the elements of h. Eqn. 6 showsthe interaction between three data-dependent terms.The first term, 1

2α−1i (vTWi)

2, is the contribution dueto the variance in s about its mean (note the scalingwith α−1i ) and appears in the sigmoid as a result ofmarginalizing out s. This term is always non-negative,so it always acts to increase P (hi | v). Countering thistendency to activate hi is the other term quadraticin v, − 1

2vTΦiv, that is always a non-positive contribu-

tion to the sigmoid argument. In addition to these twoquadratic terms, there is the term vTWiµi whose be-haviour mimics the data-dependent term in the anal-ogous GRBM version of the conditional distributionover h: PG(hi | v) = σ

(vTWi + bi

).

Another perspective on the behaviour of p(hi | v) as

a function of v is gained by considering an alternativearrangement of the terms:

P (hi= 1 | v) = σ(bi −1

2(v − ξv|hi

)TC−1v|hi

(v − ξv|hi)) (7)

in which Cv|hi=

(Φi − α−1i WiW

Ti

)−1, ξv|hi

=

Cv|hiWiµi and bi is a re-parameterization of the bias

that incorporates the remainder of the completion ofthe square. It is evident from this form of P (hi = 1 | v)that, in the event that the matrix Cv|hi

is positivedefinite, then P (hi | v) reaches its maximum whenv = Cv|hi

Wiµi. We can easily confirm that the tailbehaviour, as v departs from its maximum, is Gaus-sian:

σ(−x2) =exp(−x2)

1 + exp(−x2)→x→∞

exp(−x2)

We will look to take advantage of the µ-ssRBM rela-tionship between Φi and α−1i WiW

Ti when we consider

ways to constrain the covariance of p(v | h) to be pos-itive definite in Section 3.

To complete the exposition of the basic µ-ssRBMmodel, we present the free energy f(v) of the visiblevector.

f(v) = − log∑h


=1

2vTΛv − 1

2

N∑i=1

log(2πα−1

i

)−

N∑i=1

log

[1 + exp

{−1

2vTC−1

v|hiv + vTWiµi + bi

}]

3. Positive Definite Parameterizationsof the µ-ssRBM

Equation 5 reveals an important property of the µ-ssRBM model. The conditional p(v | h) is only a well-defined Gaussian distribution if the covariance matrixCv|h is positive definite (PD). However, the covariancematrix is not parametrized to guarantee that this con-dition is met. If there exists a vector x such thatxTCv|hx ≤ 0, then the covariance matrix is not posi-tive definite. In the original presentation of the ssRBMin Courville et al. (2011), the possibility of the non-positive-definiteness of the conditional covariance of vgiven h was dealt with by limiting the support over thedomain of v (i.e., RD) to a large but finite ball thatencompasses all the training data. Such an approachis feasible but unsatisfying as the model would natu-rally tend to build-up probability density close to this


arbitrary boundary – a region of the input space thatshould have low density.

Here, in the context of the µ-ssRBM model, we turnto the question of how we can constrain the modelparameters to guarantee that the model remains well-behaved (i.e., all conditionals are well-defined proba-bility densities). The problem we face is ensuring thatthe covariance or equivalently the precision matrix ofp(v | h) is positive definite, i.e., we wish to satisfy theconstraint:

xTC−1v|hx > 0 ∀x 6= 0

To satisfy this constraint, we need to ensure that Λ +∑Ni=1 Φihi is large enough to offset

∑Ni=1 α

−1i WiW

Ti hi.

We consider two basic strategies: (1) define Λ to belarge enough to offset a worst-case setting of the h;and (2) define the Φi to ensure that the contributionof each active hi is itself PD.

3.1. Constraining Λ

One option to ensure that Cv|h remains PD for allpatterns of h activation is to constrain Λ to be largeenough. In setting a constraint on Λ, we will ignorethe contribution of the Φi terms (which leads to non-tightness of the constraint). Since the contribution ofevery α−1i WiW

Ti hi term is negative semi-definite, the

worst case setting of the h would be to have hi = 1 forall i ∈ [1, N ]. This implies that Λ must be constrainedsuch that:

xT(

Λ−N∑i=1

α−1i WiW

Ti

)x > 0 ∀x 6= 0. (8)

If we take Λ to be a scalar matrix: Λ = λI, thenthe problem of enforcing a PD precision matrix re-duces to ensuring that λ is greater than the maximumeigenvalue ρ of

∑Ni=1 α

−1i WiW

Ti . In practice we can

use the power iteration method to quickly estimate anupper bound on the maximum eigenvalue, and thenconstrain λ > ρ throughout training.

3.2. Constraining Φ

Another option to ensure that Cv|h remains PD forall patterns of h activation is to constrain Φi to belarge enough to ensure that the contribution of eachhi is PD. Let Wij be the jth element of the filter Wi

(or equivalently, the ijth element of the weight matrixW) and let Φij denote the jjth element of the diagonalΦi matrix. We want to choose Φi such that:

J(x,Φi) =∑j

x2jΦij − (∑j

wijxj)2 > 0 ∀x ∈ RD, x 6= 0.

This condition will be satisfied for all x if we can satisfyit for the x = u of norm 1 that minimizes J(x,Φi)(u is the eigenvector of the smallest eigenvalue of thematrix

(Φi − α−1i WiW

Ti

)). To find u, we define the

Lagrangian

L(x,Φi) = J(x,Φi) + η(1−∑j

x2j )

to enforce the constraint∑j x

2j = 1. Setting ∂L

∂xj= 0

we recover the set of constraints:

∑j

W 2ij

Φij − η= 1 (9)

∑j

W 2i Φij

(Φij − η)2> 1

η > 0

Consider the following general parametrization of Φi:

Φij = η + qα−1W 2ij

βij.

This particular form is chosen so that our constraintset gives us q =

∑j βij and βij > 0. So with Φij

parametrized as

Φij = ζij + α−1 W 2ij

βij/∑j βij

with ζij > 0, the covariance matrix of p(v | h) is guar-anteed to be PD. The parameter ζij is an extra degreeof freedom to Φij to be estimated through maximumlikelihood learning.

We are free to choose the parametrization of the βijprovided βij > 0. For example, with the choice βij =W 2ij , Φi simplifies to

Φij = ζij + α−1∑j

W 2ijI (10)

where Φij takes the form of a scalar matrix. Alterna-tively, we could choose βij = 1 with the result that

Φij = ζij + α−1DW 2ij (11)

where the jth elements on the diagonal of Φi is scaledwith W 2

ij .

While we are free to chose βij > 0 as we would like,the decision affects the inductive bias of the model. Inthe case of the Φij parametrization given in Eqn. 10,

the presence of the∑Ni Φihi as a scaling on the mean

of the conditional p(v | s, h) (Eqn. 3) implies that theactivation of any hi will have an effect on the scalingof the mean across the entire visible vector (or layer)irrespective of how localized is the corresponding filter


Wi. Unsurprisingly, use of this parametrization tendsto encourage both sparsely active hi and Wi havingrelatively large receptive fields.

The Φi parametrization via Eqn. 11 has the propertythat the Φi receptive fields are steered in the direc-tion of Wi. Where Wi is near zero, Φi has little effect(unless it is mediated by ζi). This is an appealingproperty for modeling images or other data that giverise to sparse receptive fields Wi.

3.3. Comparing strategies

The two general strategies to guarantee that the co-variance matrix of p(v | h) is positive definite arein some sense complementary. In the sparse operat-ing regime of the µ-ssRBM (most hi are inactive overmost of the dataset), the Λ worst case assumption that∀i : hi = 1, becomes increasingly inaccurate and, as aresult, the constraint on Λ becomes increasingly con-servative. Therefore in the sparse regime, the Φ con-straints would seem more appropriate. On the otherhand, in a highly non-sparse regime, the individualcontributions of the Φ to the global precision matrixcan combine to form a more conservative PD precisionmatrix than would result from a constrained Λ.

It is also possible to distribute responsibility for en-suring the constraint is satisfied jointly to Λ and Φ.This would constitute a mixed strategy, apportioningresponsibility for compensating for the negative defi-nite −

∑Ni=1 α

−1i WiW

Ti hi term to both Λ and Φ.

4. µ-ssRBM Learning and Inference

Learning and inference in the µ-ssRBM proceeds anal-ogously to the original ssRBM and is rooted in theability to efficiently draw samples from the model viaGibbs sampling. As with the original ssRBM, we seeka set of conditionals that will enable simple and effi-cient Gibbs sampling. Since sampling from the con-ditional p(v | h) would involve the computationallyprohibitive step of inverting a non-diagonal covari-ance matrix, we pursue a strategy of alternating sam-pling from the conditionals P (h | v), p(s | v, h) andp(v | s, h). Each of these conditionals has the prop-erty that the distribution factors over the elements ofthe random vector, allowing us to efficiently samples.

In training the µ-ssRBM, we use the stochastic maxi-mum likelihood algorithm (SML, also known as persis-tent contrastive divergence) (Tieleman, 2008), whereonly one or a few Markov Chain (Gibbs) simulationsare performed between each parameter update. Thesesamples are then used to approximate the expectationsover the model distribution p(v, s, h).

The data log likelihood gradient, ∂∂θi

(∑Tt=1 log p(vt)

),

is:

−T∑t=1

⟨∂

∂θiE(vt, s, h)

⟩p(s,h|vt)

+T

⟨∂

∂θiE(v, s, h)

⟩p(v,s,h)

The log likelihood gradient takes the form of a differ-ence between two expectations, with the expectationsover p(s, h | vt) in the “clamped” condition, and theexpectation over p(v, s, h) in the “unclamped” condi-tion. As with the standard RBM, the expectationsover p(s, h | vt) are amenable to analytic evaluation.

5. Comparison to Previous Work

There is now a significant body of work on modellingnatural images with RBM-based models. The closestconnection to this work is obviously to the originalssRBM (Courville et al., 2011) which we recover bysetting the µ-ssRMB parameters µi = 0 and Φi =0 for all i ∈ [1, N ]. A slightly less obvious limitingcase of the µ-ssRMB is the Gaussian RBM (GRBM).Setting Φi to be proportional to α−1i (as discussed inSection 3) and taking αi = α→∞, we define a Diracin s about µ, In this limit, the conditionals p(v | h)and P (hi = 1 | v) (Eqns. 5 and 6 respectively) aregiven by:

limα→∞

p(v | h) = N

(Λ−1

N∑i=1

Wiµihi Λ−1

)limα→∞

P (hi = 1 | v) = σ(vTWiµi + bi

)If we fix µ = 1 and Λ = I, we recover the GaussianRBM conditionals. Note that the connection betweenthe µ-ssRBM and the Gaussian RBM is mediated en-tirely by the µ parameter, the original ssRBM has nosuch connection with the GRBM.

Beyond the ssRBM, the closest models to theµ-ssRBM are the mean and covariance RBM(mcRBM) (Ranzato & Hinton, 2010) and the meanProduct of t-distributions model (mPoT) (Ranzatoet al., 2010). Like the µ-ssRBM, both of these areenergy-based models where the conditional distribu-tion over the visible units conditioned on the hiddenvariables is a multivariate Gaussian with nonzero meanand a non-diagonal covariance matrix. However the µ-ssRBM differs from these models in the way the con-ditional means and covariance interact. In both themcRBM and the mPoT model, the means are mod-elled by the introduction of additional GRBM hiddenunits, whereas in the µ-ssRBM, each hidden unit canpotentially contribute to both the conditional meanand covariance. For the ith hidden unit, the con-


tribution to each is controlled by the relative val-ues of the µi and αi. With large |µi| and small αi,the unit predominantly contributes to the conditionalmean. Conversely, with large αi and small |µi|, theunit mostly contributes to the conditional covariance.The advantage of the µ-ssRBM approach is that weare able to save a hyper-parameter by letting maxi-mum likelihood induction optimize the trade-off be-tween mean and covariance modeling. Empirically, wefind that while some hidden units learn to focus ex-clusively on the conditional covariance (with µi ≈ 0);most units take advantage of the flexibility offered inthe µ-ssRBM framework and contribute to both theconditional mean and covariance.

The mPoT and mcRBM also differ from the µ-ssRBMin how they parametrize the visible covariance. Whilethe µ-ssRBM uses h activations to pinch the precisionmatrix along the direction specified by the correspond-ing weight vector, both the mcRBM and the mPoTmodels use their latent variable activations to main-tain constraints, decreasing in value to allow variancein the direction of the corresponding weight vector.

Interestingly, the µ-ssRBM term involving Φi is verysimilar to the covariance term in the mcRBM energyfunction. The difference is that our restriction to adiagonal Φi significantly limits what we can modelwith it. However, this restricted structure allows usto sample efficiently from the model using Gibbs sam-pling which is not available to either the mcRBM orthe mPoT model. In addition, the roles of these covari-ance terms are also quite different. In the mcRBM, thecovariance term is associated with the model’s featurevectors. In the µ-ssRBM, we use the Φi both to helpconstrain the model’s conditionals to be well-defined,and in conjunction with W , to help maximize the like-lihood of the training data.

Finally, the covariance structure of the µ-ssRBM con-ditional p(v | h) (Eqn. 5) is very similar to the productof probabilistic principal components analysis (PoP-PCA) model (Williams, 2001) with components cor-responding to the µ-ssRBM weight vectors associatedwith the active hidden units (hi = 1).

6. Experiments

We demonstrate the utility of the µ-ssRBM on theCIFAR-10 dataset by classifying images and by sam-pling from the model. In particular, our experimentsare directed toward exploring the properties of the dif-ferent elements of the model, including the roles of µand Φ and the effects of the various Λ and Φ PD con-straints.

Figure 1. (Top left) ZCA-whitened data used for patch-wise training. (Top right) Filters W learnt when µ and Φwere fixed at zero. These filters produce edges similar tomany other models, and neatly separate black-and-whiteedges from colour ones. Filters W (bottom left) and Φ(bottom right) learnt when µ and Φ are fit to the data.The combination of W and Φ gives individual units moreflexibility, and gives rise to a richer variety of features.

Our experiments are based on the CIFAR-10 imageclassification dataset consisting of 40 000 training im-ages, 10 000 validation images, and 10 000 test images.The images are 32-by-32 pixel RGB images. Each im-age is labelled with one of ten object categories (aero-plane, automobile, bird, cat, deer, dog, frog, horse,ship, truck) according to the most prominent objectin the image.

Classification: We evaluate the µ-ssRBM as afeature-extraction algorithm by plugging it into theclassification pipeline developed by Coates et al.(2011). In broad strokes, the µ-ssRBM is fit to (192-dimensional) 8x8 RGB image patches, and then ap-plied convolutionally to the 32x32 images. The imagepatches (starting from pixels between 0 and 255) onwhich the µ-ssRBM was trained were centered, andthen normalized by dividing by the square root of theirvariance plus a noise-cancelling constant (10). Thenormalized patches were whitened by ZCA (Hyvarinen& Oja, 2000) with a small positive constant (0.1)added to all eigenvalues. The resulting patches (Fig-ure 1, top left) are mostly grey with high spatial fre-quencies amplified, and lower spatial frequencies at-tenuated. Our models were trained from the 16 non-overlapping 8x8 patches from each of the first 10 000


Table 1. The performance of µ-ssRBM variants with 256hidden units in CIFAR-10 image classification (± 95% con-fidence intervals). “no PD” = without PD constraint.

Model Accuracy (%)no PD, µ free, Φ free 73.1 ± 0.9no PD, µ free, Φ = 0 71.43 ± 0.9no PD, µ = 0, Φ free 71.19 ± 0.9no PD, µ = 0, Φ = 0 68.92 ± 0.9PD by Diag. W (Eqn. 11) 69.1 ± 0.9PD by Λ (Eqn. 8) 68.3 ± 0.9PD by scal. mat. (Eqn. 10) 67.1 ± 0.9

training set images in CIFAR-10 (for a total of 160 000training examples).

Models were trained for one hundred thousand mini-batches of 100 patches. On an NVIDIA GTX 285GPU this training took on the order of 15 minutes formost models. We used SML training (Tieleman, 2008).Classification was done with an `2-regularized SVM.The SVM was applied to the conditional mean valueof latent spike (h) variables, extracted from every 8x8image patch in the 32x32 CIFAR-10 image. Prior toclassification, our conditional h values were spatiallypooled into 9 regions, analogous to the 4 quadrantsemployed in Coates et al. (2011). For a model withN hidden units, the classifier operated on a featurevector of 9N elements.

Table 1 lists the performance of several variants onthe µ-ssRBM model. For this comparison all variantswere trained with a small amount of sparsity aimedat maintaining 15% activity, and were configured with256 hidden units. The lines labelled PD correspondto models that were constrained to have positive def-inite covariance of p(v | h) while the lines labelledno PD are not. If µ = 0 appears in a line the corre-sponding model was trained with µ = 0 or equivalentlywith the µ terms removed from the energy function.The nomenclature for Φ is analogous. This implies forinstance that the original ssRBM model would corre-spond to the no PD, µ = 0, Φ = 0 condition.

Table 1 reveals that it is possible to constrain Φ to en-force that Cx|h is PD and achieve classification resultsthat match that of the original ssRBM. However, if wetake the same µ-ssRBM form and loosen the PD con-straint, the model can perform much better. Also ofnote is that both the µ and the Φ terms seem to con-tribute approximately equally to improving the classi-fication accuracy.

Table 2 situates the performance of the µ-ssRBM inthe literature of results on CIFAR-10. The µ-ssRBM

Table 2. The performance of µ-ssRBM relative to othermodels in the literature for CIFAR-10 (± 95% confidenceinterval). The k-means results are taken from Coates et al.(2011), the “conv. trained DBN” result is the convolution-ally trained two-layer Deep Belief Network (DBN) withrectified linear units, reported in Krizhevsky (2010), andthe GRBM, cRBM, mcRBM results are taken from Ran-zato & Hinton (2010)

Model Accuracy (%)k-means (4000 units) 79.6 ± 0.9conv. trained DBN 78.9 ± 0.9µ-ssRBM (4096 units) 76.7 ± 0.9k-means (1200 units) 76.2 ± 0.9µ-ssRBM (1024 units) 76.2 ± 0.9k-means (800 units) 75.3 ± 0.9µ-ssRBM (512 units) 74.1 ± 0.9µ-ssRBM (256 units) 73.1 ± 0.9k-means (400 units) 72.7 ± 0.9k-means (200 units) 70.1 ± 0.9mcRBM (225 factors) 68.2 ± 0.9cRBM (900 factors) 64.7 ± 0.9cRBM (225 factors) 63.6 ± 0.9GRBM 59.7 ± 1.0

performs better than the most closely-related mod-els - the GRBM, cRBM, and mcRBM. Recent workby Coates et al. (2011) has shown that a feature-extractor based on K-means actually out-performsthese energy-based approaches to feature extractionon CIFAR-10, in the limit of very large hidden unitcounts. Future work will look at more effective train-ing strategies for energy models in this regime.

Model Samples: To draw samples from the model,we trained it convolutionally, similarly to Krizhevsky(2010). Our convolutional implementation of the µ-ssRBM included 1000 fully-connected units to captureglobal structure, and 64 hidden units for every posi-tion of an 9x9 RGB filter. The model was trainedon the CIFAR dataset, centered and globally contrastnormalized. Filters W and Φ were shared across theimage, though independent scalar-parameters µi, αi,and hidden unit bias bi were allocated for each indi-vidual hidden unit. Figure 2 illustrates some samplesdrawn from the model. The samples are taken from thenegative phase near the end of training (with the learn-ing rate annealed to near zero). These samples exhibitglobal coherence, and sharp region boundaries. Quali-tatively, these samples are more compelling than sam-ples from similar energy-based models, such as thosefeatured in Ranzato et al. (2010).


Figure 2. (Left) Samples from a convolutionally trained µ-ssRBM exhibit global coherence, sharp region boundaries, arange of colours, and natural-looking shading. (Right) The images in the CIFAR-10 training set closest (L2 distance withcontrast normalized training images) to the corresponding model samples. The model does not appear to be capturingthe natural image statistical structure by overfitting particular examples from the dataset.

7. Discussion

In this paper we have introduced the µ-ssRBM, a gen-eralization of the ssRBM that includes extra terms inthe energy function. One of these terms permits themodel to capture a non-zero mean in the Gaussianconditional p(v | h), bringing the ssRBM framework inline with the recent work of Ranzato & Hinton (2010)and Ranzato et al. (2010) which also modelled the con-ditional of the observed data given the latent variablevalue to be a general multivariate Gaussian with non-zero mean and full covariance. Unlike these other ap-proaches, instantiating the slab vector s renders theµ-ssRBM amenable to efficient block Gibbs sampling.

The other functional term included in the µ-ssRBMenergy function adds a positive definite diagonal con-tribution to the covariances associated with the Gaus-sian conditions over the observations. This term wasused to define variants of the µ-ssRBM that were con-strained to have well-defined conditional.

Still, our techniques for constraining the µ-ssRBM tohave PD conditionals are based on loose worst-casescenarios, and potentially leave room for improvement.Our classification experiments indicate that the µ-ssRBM was able to use the extra capacity offered bythe addition of these elements to the energy functionto improve the classification accuracy. They also showthat the addition of the PD constraint comes at thecost of classification performance.

Acknowledgements

We acknowledge NSERC, FQRNT, RQCHP and Com-pute Canada for their financial and computational sup-port. We also thank Chris Williams for insightful com-ments.

References

Coates, A., Lee, H., and Ng, A. Y. An analysis of single-layer networks in unsupervised feature learning. In AIS-TATS’2011, 2011.

Courville, A., Bergstra, J., and Bengio, Y. The spike andslab restricted Boltzmann machine. In AISTATS’2011,2011.

Hinton, G. E. Training products of experts by minimizingcontrastive divergence. Neural Computation, 14:1771–1800, 2002.

Hyvarinen, A. and Oja, E. Independent component anal-ysis: Algorithms and applications. Neural Networks, 13(4–5):411–430, 2000.

Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, University of Toronto,2009.

Krizhevsky, A. Convolutional deep belief networks onCIFAR-10. Technical report, University of Toronto,2010.

Mitchell, T. J. and Beauchamp, J. J. Bayesian variableselection in linear regression. J. Amer. Statistical Assoc.,83(404):1023–1032, 1988.

Ranzato, M. and Hinton, G. E. Modeling pixel meansand covariance using factorized third-order Boltzmannmachines. In CVPR’10. IEEE Press, 2010.

Ranzato, M., Mnih, V., and Hinton, G. Generating morerealistic images using gated MRF’s. In NIPS’10, pp.2002–2010. MIT Press, 2010.

Salakhutdinov, R. and Hinton, G. E. Deep Boltzmannmachines. In AISTATS’2009, pp. 448–455, 2009.

Tieleman, T. Training restricted Boltzmann machines us-ing approximations to the likelihood gradient. In ICML2008, pp. 1064–1071, 2008.

Williams, C. K. I. On a connection between kernel PCAand metric multidimensional scaling. In NIPS’00, pp.675–681. MIT Press, 2001.

unsupervised models of images by spike-and-slab rbmslisa/pointeurs/icml2011_ssrbm.pdf · the spike...

Documents