difficulties in the use of auxiliary variables in markov chain monte carlo methods
TRANSCRIPT
0960-3174# 1997 Chapman & Hall
Di�culties in the use of auxiliary variables
in Markov chain Monte Carlo methods
MERRILEE HURN
School of Mathematical Sciences, University of Bath, Bath BA2 7AY, UK
Received January 1996 and accepted July 1996
Markov chain Monte Carlo (MCMC) methods are now widely used in a diverse range of appli-
cation areas to tackle previously intractable problems. Di�cult questions remain, however, in
designing MCMC samplers for problems exhibiting severe multimodality where standard
methods may exhibit prohibitively slow movement around the state space. Auxiliary variable
methods, sometimes together with multigrid ideas, have been proposed as one possible way for-
ward. Initial disappointing experiments have led to data-driven modi®cations of the methods.
In this paper, these suggestions are investigated for lattice data such as is found in imaging and
some spatial applications. The results suggest that adapting the auxiliary variables to the
speci®c application is bene®cial. However the form of adaptation needed and the extent of
the resulting bene®ts are not always clear-cut.
Keywords: Adaptive methods, auxiliary variables, Bayesian image analysis, confocal ¯uores-
cence microscopy, Markov chain Monte Carlo, multigrid, multimodality
1. Introduction
Suppose that it is necessary to estimate some functional of a
random variableXwhich has distribution��x�. When exact
answers cannot be found, simulation from ��x� is often
used instead. But what if ��x� is su�ciently complex that
direct simulation is out of the question? Markov chain
Monte Carlo (MCMC) methods developed in response to
this problem (Metropolis et al. 1953; Hastings, 1970). A
Markov chain of X values is generated on the same state
space , where by design this iterative sequence converges
weakly to the desired target distribution��x�. In the cases
considered here,X will be de®ned for a discrete state space
. Conditions on the transition function of the chain
P�x! x0
� for the required convergence are that it is irre-
ducible, aperiodic, and has ��x� as its equilibrium distri-
bution. The Gibbs sampler (Geman and Geman, 1984)
and the Metropolis±Hastings algorithm are the two most
commonly used variants of this idea. See the papers by
Besag and Green (1993), Smith and Roberts (1993) and
Gilks et al. (1993) for a thorough discussion of MCMC
theory and application.
Two important questions which arise on a problem-by-
problem basis are the rate of convergence to the target dis-
tribution, and the accuracy of ergodic average estimation
from the chain of correlated samples. Green and Han
(1991) detail how these relate to the transition function
P�x! x0
�. Summaries of theoretical results in the assess-
ment of convergence, and extensive reviews of the various
output diagnostics used in practice can be found in Cowles
and Carlin (1996) and Brooks and Roberts (1996). Some
distributions are intrinsically harder to sample than others.
In particular, as might be expected, distributions exhibiting
a high degree of multimodality or strong correlations
between components of X are particularly awkward. It is
for such situations that auxiliary variable samplers have
been proposed. In Section 2, the auxiliary variable method
is described, together with two modi®cations which adapt
the algorithm to the particular application. These methods
are applied to an archaeological data set exhibitingmultimo-
dality in Section 3, and to an image analysis problem in Sec-
tion 4. Suggestions as to the type of problem where these
adaptive approaches can be useful are given in Section 5.
2. Auxiliary variable methods
2.1. The general case
The use of auxiliary variables in MCMC methods origi-
nates in the statistical physics literature. Swendsen and
Statistics and Computing (1997) 7, 35±44
Wang (1987) give a two-stage algorithm for simulating
from a supercritical Ising model for which long-range
correlations exist on the lattice. A model of this type pre-
sents great di�culty to samplers which update one site at
a time because the high interaction penalties lead to slow
convergence and highly correlated samples. The Swendsen
and Wang algorithm's good performance comes from its
ability to `decouple' some of the sites, leaving conditionally
independent single-valued groups of sites for updating. Sta-
tistical interpretation of the generalized algorithm is given
by Besag and Green (1993) in terms of auxiliary variables:
a new variable U, the auxiliary variable, is introduced by
de®ning its conditional distribution given X, ��ujx�. A
Markov chain is then constructed alternating between tran-
sitions onU, by drawing from��ujx�, and transitions onX.
To ensure that this two-step procedure has the desired tar-
get distribution ��x� as its stationary distribution, the tran-
sition function P�x! x0
ju� satis®es detailed balance with
respect to the conditional ��xju�. The simplest choice for
this is P�x! x0
ju� � ��x0
ju� (although see Mùller, 1993,
for one alternative).
To see how this construction might help, assume that the
distribution of interest is strictly positive forX in the sample
space, and that it can be written in the form
��x� /
Y
S
s�1
as�xs�Y
k
bk�x�; �1�
where s indexes the components ofX, as�xs� represents the
part of the model not due to component interaction, andk
indexes the interaction terms fbk�x�g. It is the interaction
terms fbk�x�g which hinder simulation. De®ne one com-
ponent of U for each interaction k, with these fUkg
conditionally independent givenX, and each with uniform
distribution
Ukjx � U�0; bk�x��: �2�
Simulation from the distribution��ujx� is straightforward.
Then, via the joint distribution ofX and U obtained from
(1) and (2),
��x0
ju� /
Y
S
s�1
as�x0
s�
Y
k
I�bk�x
0
��uk�: �3�
Simulation from ��x0
ju� would also be free of any inter-
action complications if it were not for the indicator terms,
Q
k I�bk�x0��uk�. Notice though that when uk � minx bk�x�
for any k, there is no restriction on the corresponding com-
ponents of x0
with all possible values satisfying I�bk�x
0
��uk�.
Groups of x0
components connected by a path of inter-
actions k for which each uk > minx bk�x� are known as
`clusters'. Within clusters, the indicator conditions may be
non-trivial to satisfy. However, between clusters any x0
will satisfy the indicator conditions, and thus clusters are
conditionally independent given U and can be updated
separately. The net result is to be able to update groups
of components, the clusters, simultaneously having killed
the between-cluster interactions.
Finally, irreducibility and aperiodicity of the transition
function P�x! x0
� can be noted in the following way. The
strict positivity of ��x� means that it is possible to move
from any x to any other x0
via a realization u in which each
interaction component satis®es uk � minx bk�x� (in e�ect
when all the clusters each consist of a single component).
2.2. The noisy Ising model
One situation in which the auxiliary variable construction is
easily implemented, and which is of considerable interest in
classi®cation problems such as arise in some imaging appli-
cations, is a noisy version of the Ising model. HereX is an
underlying classi®cation for example, in Section 3,X is the
presence or absence of previous human activity at 10m
spaced archaeological sites. Data fYsg are available on
the same lattice, and are assumed to be the typical response
of the classi®cation at site s to the measurement process,
�Xs; corrupted by pixel-wise independent additive Gaussian
noise of variance �2
. In the usual Bayesian framework, the
prior distribution on X is taken to be a form of the Ising
model with nearest neighbour spatial interactions hs; ti
and parameter � (see Besag et al., 1991), and the distribu-
tion of interest is the posterior of the classi®cation given
the data,
��xjy� /
Y
S
s�1
exp�ÿ�2�2
�ÿ1
�ys ÿ �xs�2
�
�
Y
hs;ti
exp�ÿ�I�xs 6�xt�
�: �4�
The Ising prior contributes the hs; ti interaction terms, and
the Gaussian likelihood the pixel-wise `preference' terms. In
this situation, the auxiliary variableUhs;ti follows one of two
uniform distributions depending on the currentx:
Uhs;tijx �
�
U�0; 1�
U�0; exp�ÿ���
if xs � xt
if xs 6� xt;�5�
and then
��x0
ju� /
Y
S
s�1
exp
ÿ�ys ÿ �xs�2
2�2
!
;
provided
exp�ÿ�I�x0s 6�x
0
t�� � u
hs;ti8hs; ti: �6�
When are the conditions on x0
in (6) satis®ed? Here
bhs;ti�x� is dichotomous, with minx bhs;ti�x� � exp�ÿ��. If
uhs;ti � exp�ÿ��, then any x
0
s and x0
t will satisfy the hs; ti
constraint. However if uhs;ti > exp�ÿ��, the requirement
is that x0
s equals x0
t. Values of uhs;ti � exp�ÿ�� could
have occurred either when xs � xt or when xs 6� xt, but
obtaining uhs;ti > exp�ÿ�� is only possible when xs � xt.
Expressed in words, the algorithm `bonds' neighbouring
36 Hurn
like-valuedX pixels with probability 1ÿ exp�ÿ��, and clus-
ters consist of these bonded pixels. In generating newX0
values from (6), the clusters are required to stay single-
valued; however, they are conditionally independent of
one another, and exact simulation from (6) is possible.
This capacity to update groups of pixels simultaneously is
particularly interesting when the value of � is high since
strong interaction terms lead to slow changes in the classi-
®cation when updating a single site at a time.
Unfortunately, studies of auxiliary variable samplers for
model (4) by Gray (1994) and Hurn and Jennison (1993)
conclude that the algorithm's performance is disappointing
and that approaches in which a single pixel is updated at a
time, for example the usual application of the Gibbs
sampler, can be more e�ective, particularly for high �
values. One explanation corroborated by monitoring
successive realizations is that at the U transition stage,
pixels are clustered according to the interaction terms ±
when � is high, clusters will tend to be large. These large
clusters are then updated according to the remaining non-
interaction model terms, the terms which express a pixel's
preference when seen in isolation for a particular value.
No account is taken in the clustering of whether a simulta-
neous update of these particular pixels to a common value
may be advantageous. This combination leads to slow
convergence and highly correlated samples. Two adaptive
strategies are now considered which attempt to overcome
this problem.
2.3. Partial decoupling
Higdon (1993) suggests a modi®cation to the algorithm for
the noisy Ising model which is intended to discourage pixels
with large di�erences in data values from bonding. Making
the straightforward generalization of Higdon's suggestion,
a positive constant �k is introduced for each interaction
and us
Ukjx � U�0; �bk�x��1=�k
�; �7�
so that
��x0
ju� /
Y
S
s�1
as�x0
s�
Y
k
�bk�x0
��1ÿ1=�kI
�bk�x0
��uk�: �8�
For the noisy Ising model, this modi®cation means that
neighbouring like-valued pixelss and t are bonded with prob-
ability 1ÿ exp�ÿ�=�hs;ti�; Higdon suggests using the values
�hs;ti � 1� jys ÿ ytj. The resulting clusters are still required
to be single-valued but are no longer conditionally indepen-
dent, retaining an interaction exp�ÿ��1ÿ 1=�hs;ti� I
�x0s 6�x0
t��
in (8) for each hs; ti between clusters. This means that it is
not possible to sample directly from (8) as it was from (6) pre-
viously; however, an MCMC step such as a Gibbs sampler
can be used to sample indirectly, generating a newx0
value
for each cluster in turn.
2.4. Multigrid implementation
The combination of multigrid methods with auxiliary vari-
ables has also been suggested (Besag and Green, 1993; Han,
1993; Kandel et al., 1989), although there is more than one
way to achieve this. If as suspected, one of the reasons for
poor performance is the uncontrolled growth of clusters,
then one way to limit cluster size would be to divide the lat-
tice into four smaller grids, with clusters not permitted to
grow beyond these spatial boundaries. These four grids
can then themselves be subdivided, and so on, the variation
in grid dimensions being the spatial multigrid component.
For example, on a 16� 16 grid, ®ve levels of blocking are
possible; if the ith level of block size, where the grids are
2
i� 2
ipixels in dimension, is denoted as level i, then a
`W'-cycle involves an update ofX at each level in the cycle
(0, 1, 2, 3, 4, 3, 4, 3, 2, 3, 4, 3, 4, 3, 2, 1) before repeating.
This is the approach taken by Han (1993).
The spatial blocking scheme does not adapt to the data in
that the same boundary pattern would be used whatever the
actual data values recorded on the lattice. In the somewhat
di�erent context of optimization using simulated annealing,
Hurn and Jennison (1995) showed that adaptive behaviour,
generating block updating schemes according to the data,
improved the performance. Hurn (1994) shows that the spa-
tial blocking scheme is just one of the possibilities for limit-
ing cluster size by only selectively `killing' interactions. The
interactions fkg can be partitioned into two disjoint sets,K
and R: K is the set of interactions which may be killed, and
R is the set of interactions which will be retained. By only
introducing components of the auxiliary variable U for
those k 2 K, the set R in e�ect de®nes boundaries across
which bonds cannot form. De®ne
Ukjx � U�0; bk�x�� for k 2 K �9�
as before. Interaction terms bk�x0
� must now be retained in
��x0
ju� across the remaining k, that is for those k in the set
R, so that
��x0
ju� /
Y
S
s�1
as�x0
s�
Y
k2R
bk�x0
�
Y
k2K
I�bk�x
0
��uk�: �10�
Clusters will not be conditionally independent if they are
linked by any k 2 R interaction, and the same comments
then apply to sampling from (10) as did to sampling from (8).
The modi®ed algorithm satis®es the necessary conditions
for convergence provided the Markov nature is not
destroyed by the choice of which auxiliary variables to
include. This partition of fkg into K and R is important
as it determines how the clustering will be controlled. One
possibility is to choose R spatially as already described,
dividing the grid into successively smaller grids. Another
option is to retain interactions across which there is a large
di�erence in pixel preference; for the noisy Ising model,R
could be fhs; tig for which jys ÿ ytj exceeds some speci®ed
threshold (the `multigrid' interpretation here would come
37Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods
from varying the threshold). The former of these allows
maximum cluster size to be strictly controlled, but does
not adapt to the data. The latter should discourage pixels
with widely di�erent preferences from bonding directly,
although since closed boundaries are not necessarily pro-
duced such pixels could still ®nd themselves in the same
large cluster.
3. Anarchaeological applicationexhibitingmultimodality
3.1. The estimation problem
Higdon (1993) suggests the partial decoupling approach for
a modi®cation of the small archaeological data set used by
Besag et al. (1991) and Gray (1994). The original data com-
prise 16� 16 measurements of log phosphate levels taken
from equally spaced soil samples. The objective of the
analysis was to estimate whether there had been previous
human activity, indicated by raised phosphate levels, at
each of the sites. A Bayesian image analysis approach was
used, specifying an eight-neighbour Ising type prior for
the presence/absence classi®cation X and assuming pixel-
wise independent additive Gaussian noise. The classi®-
cation at each site was then taken to be the most likely
under the posterior given everything else, an estimation
procedure which requires MCMC sampling. To create a
particularly di�cult sampling problem, Higdon con-
structed an arti®cial, bimodal, posterior surface based on
this example. The posterior to be sampled is
��xjy� /
Y
S
s�1
exp�ÿ�2�2
�ÿ1
�ys ÿ �xs�2
�
�
Y
hs;ti
exp�ÿ�hs;tiI�xs 6�xt��; �11�
where the parameters are taken to be �2
� 1, �presence
� 2,
�absence
� 1, �hs;ti � 0:768 for ®rst order neighbours, and
�hs;ti � 0:476 for second order neighbours.
3.2. Results
The sampling problem is su�ciently hard that a non-
standard measure of performance is used, the mean mode
swapping time (the number of updates of the grid taken
to move from one mode to the other and then back, aver-
aged over 2000 of these two-way swaps). Comparisons
are made between the Gibbs sampler, the auxiliary variable
method, Higdon's partially decoupled sampler, and two
multigrids. The ®rst of these is the spatial multigrid, divid-
ing the grid into four, sixteen, and so on. The second multi-
grid is an adaptive version, allowing bonding only over
those hs;ti with the lowest values of jys ÿ ytj; in order to
have a comparable multigrid structure, the level denoted i
retains a proportion 2
ÿiof the interactions in the partition
set R. The notation `04', for example, then denotes a `W'-
cycle beginning at level 0 and with lowest level 4.
Results are given in Table 1. Clearly the Gibbs sampler
and the auxiliary variable method both ®nd it di�cult to
move between modes, and Higdon's partially decoupled
method is a great improvement. In terms of the two multi-
grids, it seems that any limiting of the cluster size
encourages mode-swapping, although the extent varies con-
siderably. In particular, the level 33 spatial multigrid
achieves the lowest mean mode swap time. It might be
thought surprising that the non-adaptive multigrid should
outperform an adaptive version for certain cycle patterns.
However, the ®rst mode of (11), containing about 70% of
the probability, is a nearly all-absence classi®cation. The
second mode di�ers from this in that it contains one main
region of pixels classi®ed as presence. This group of pixels
lies largely in one corner of the grid, and so a restriction
of the clusters to within four quarter-grid sectors facilitates
the addition and deletion of the feature. Further investiga-
tions revealed that it is important to ®nd an appropriate
form of cluster limiting, with the improved performance
coming from this rather than from the multigrid variation.
4. An image analysis example with strong interactions
4.1. Confocal ¯uorescence microscopy
Confocal ¯uorescence microscopy is a technique used to
image three-dimensional volumes without the need for
physical sectioning. Images are built up pixel by pixel by
scanning the focal point of a laser beam systematically
through a specimen stained with ¯uorescent dye. The
improvements over standard microscopy are achieved by
arranging the lenses and pinholes to capture only that ¯uor-
escence emanating from the focal point of the laser. In
Table 1.Mean mode swapping times for the amended archaeological
problem
Spatial jys ÿ ytj Gibbs Higdon's Auxiliary
multigrid multigrid sampler sampler method
2265 256 2160
``W''-cycle
01 2046 2110
02 1643 1113
03 584 709
04 475 839
11 1993 2044
12 1479 880
13 508 653
22 1280 552
23 374 594
24 395 768
33 233 571
34 398 892
38 Hurn
theory, out-of-focus information is eliminated. In prac-
tice, as the focal point of the microscope moves deeper
into the specimen, the images become more degraded.
Di�cult estimation problems arise because the scattering
and attenuation degradation are due in part to the object
being imaged. See Wilson (1990), for example, for more
background and details.
4.2. A classi®cation model
Figure 1(a) shows the ®fth of 32 progressively deeper optical
sections of the stomatal guard cells of a plant leaf (see Fricker
andWhite 1992 for technical details). These data form part of
a sequence of images observing the opening and closing
responses of the cells. Questions of biological interest are,
for example, the volume of the cells, their surface area, gener-
ally how their shape changes during the response sequence.
To address these questions, the images are currently segmen-
ted into a binary cell/background classi®cation slice by slice
using simple thresholding followed by manual editing.
Figures 1(b)±(d) show three thresholded classi®cations
of the image, using thresholds of 65, 75, and 85 respectively
( pixels are classi®ed as cell if they have a recorded
¯uorescence value higher than the threshold value, otherwise
they are classi®ed as background).
As an alternative to thresholding, a low-level imaging
approach is adopted with the pixel-based classi®cation
fXsg being either cell (1) or background (0). The simplifying
model to be assumed is thatcell and background pixels each
have a typical ¯uorescence level, �1
and �0
respectively.
Again these typical levels are assumed to be recorded sub-
ject to additive, pixel-wise independent Gaussian noise
with variance �2
, giving rise to the dataY.
��yjx; �0
; �1
; �2
� �
Y
S
s�1
exp�ÿ�2�2
�ÿ1
�ys ÿ �xs�2
�
����������
2��2
p : �12�
As is common in binary image classi®cation problems,
the prior distribution on X is taken to be another Ising-
type model with parameter �,
��xj�� � Z���ÿ1 exp
ÿ�
X
hs;ti
I�xs 6�xt�
!
; �13�
where hs; ti here indicates ®rst order nearest neighbours,
and the � dependence of the distribution's normalizing con-
stant Z��� is explicitly noted.
Given a provisional estimate of the classi®cation, the
parameters �, �0
, �1
and �2
could be estimated via maxi-
mum likelihood or pseudo-likelihood methods. However,
the quality of such estimates will depend upon the accuracy
of the provision classi®cation, and in turn this will a�ect
the ®nal estimated classi®cation. Alternatively, it is poss-
ible to specify vague priors for the model parameters,
allowing them to vary along with the realizations of X.
See Heikkinen and HoÈ gmander (1994) for arguments in
favour of this approach. The following priors have been
chosen (note that the ¯uorescence is measured in the range
�0; 255�, and that the chosen range of� covers a wide range
of behaviour of the Ising model).
� � U�0; 2� �14�
1=�2
� ÿ�1=2; 1=2� �15�
��0
; �1
� � U��0; 255� � �0; 255�� �16�
Using (12)±(16), the posterior distribution ofX and the
remaining parameters conditioned on the observed data
can be written up to proportionality as
��x; �; �2
; �0
; �1
jy� / ��yjx; �2
; �0
; �1
�
� ��xj���������0
; �1
����2
�: �17�
4.3. Simulation from the model
The aim is to draw �x; �; �2
; �0
; �1
� realizations from (17) as
e�ciently as possible. Complexity usually means that the
sampling is done one parameter or one pixel at a time,
requiring the full conditional posterior distributions:
1
�2
j::: � ÿ�
�S � 1�
2
;
X
S
s�1
�ys ÿ �xs�2
� 1
!
=2�; �18�
Fig. 1. (a) Confocal microscopy data, slice 5; (b) Classi®cation
thresholding at 65; (c) Classi®cation thresholding at 75; (d ) Classi-
®cation thresholding at 85
39Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods
Fig. 2. Realizations of ®ve monitored quantities for the ®rst 250 sweeps overlaid by the ergodic averages formed from sweeps 1000 to 5000.
GS� Gibbs sampler (solid line), S&W� auxiliary variable method (broken line)
40 Hurn
�ij::: / N
"
1
ni
X
s:xs�i
ys;�2
ni
#
; �i 2 �0; 255�; i � 0; 1; �19�
���j:::� / Z���ÿ1 exp
ÿ �
X
hs;ti
I�xs 6�xt�
!
; � 2 �0; 2�; �20�
��xsj:::� / exp
ÿ �
X
t2�s
I�xs 6�xt�
ÿ �2�2
�ÿ1
�ys ÿ �xs�2
!
;
�21�
where the notation ::: denotes `all other variables',ni in (19)
is the number of pixels such that xs � i, and �s in (21)
indicates those pixels which are nearest neighbours of s.
Realizations from (18) and (19) can be drawn e�ciently
using the Gibbs sampler with standard distributional
methods (Ripley, 1987). Sampling the Ising parameter �
from (20) requires knowledge of the analytically intract-
able normalizing constant Z���. For this reason, work in
image analysis has generally treated � as a constant, esti-
mating it once initially, or sometimes periodically as
restoration progresses. However, Geyer (1994) proposes
the method of reverse logistic regression for o�-line estima-
tion of normalizing constants, at least up to proportion-
ality. This method is used here; the distribution (20) is
then known up to proportionality, and a Metropolis±
Hastings algorithm can be used.
The main aim of this section is to investigate the sampling
of X. It is clear from looking at the data that the interaction
parameter � will be high ± the scene is comprised of large
areas of the same classi®cation. Although the data will
tend to counteract this e�ect if the noise levels are low
enough, a general consequence is that single-site updating
algorithms ®nd it hard to alter boundary positions and
make other such changes. The question is whether some
form of the auxiliary variable method would be an
improvement by way of permitting groups of pixels to
change simultaneously.
4.4. Results
Two issues in MCMC performance are how quickly the
realizations settle to behave as though they were from the
target distribution, and the accuracy of the ergodic average
estimators. As a simple diagnostic for convergence, or
rather lack of it, it is common practice to plot realizations
of several scalar-valued functionals against iteration number.
To compare the Gibbs sampler and the auxiliary variable
method, Fig. 2 gives plots of this type for the observed values
of �, �0
, �1
, �2
and the number of cell pixels; the last of these
is closely related to the biological questions of interest, and is
known to be a good indicator of sampling di�culties (Gray,
1994). Each plot shows the ®rst 250 realizations from a total
of 5000, overlaid by the ergodic averages calculated from the
®nal 4000 sweeps. Stability of �0
, �1
, �2
and the number of
cell pixels appears to be attained relatively rapidly, within
perhaps the ®rst 25 iterations. The convergence of� is slightly
less impressive, requiring closer to 150 iterations. Judging by
all ®ve monitored quantities, the auxiliary variable sampler
appears to provide slower convergence.
The e�ciency of estimation is usually measured by the
integrated autocorrelation time � de®ned as follows: as
the number of iterations at equilibrium tends to in®nity,
the asymptotic variance of the ergodic average of some
functional f � � � tends to �� the variance of f � � � which
would be obtained under independent sampling from the
stationary distribution. Sokal's method (Sokal, 1989) is
used to estimate � , taking the number of cell pixels as the
functional f � � �. Using the Gibbs sampler, the median �̂
over 20 repetitions of the simulation is 7.7. However, using
the auxiliary variable method the median �̂ is 73.0. In other
words, roughly 10 timesmore sweeps of the auxiliary vari-
able method than of the Gibbs sampler are needed to
achieve the same accuracy.
Can the performance of the Gibbs sampler be improved
upon by modifying the auxiliary variable clustering? A
spatial multigrid may not be the most appropriate form
Fig. 3.Non-bonding boundaries: (a) Level 1; (b) Level 2; (c) Level 3;
(d ) Level 4; (e) Level 5; ( f ) Level 6
41Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods
to use here; unrestricted cluster growth would still seem rea-
sonable in the regions which are clearly comprised of just a
single classi®cation, for example most of the background.
One way to accommodate this would be to prevent bond-
ing across the class boundaries formed when the data are
thresholded; Fig. 3(f ) shows these boundaries when the
threshold level is set at 85. It seems unlikely however
that a single fairly arbitrary thresholding will be able to
in¯uence the clustering su�ciently. Figures 3(a)±(e)
show the boundaries which result from considering two
threshold levels simultaneously. Interactions hs; ti will be
retained in the set R, and thus the corresponding
exp�ÿ� I�x0s 6�x
0
t�� will be in ��x
0
ju; :::�, either if they lie across
one of the threshold boundaries, or if they involve a
pixel assigned di�erent classi®cations under the di�erent
threshold levels. In this way, `undecided' pixels should be
excluded from large clusters. The number of such pixels
depends on the separation of the two threshold levels:
Fig. 3(e) uses levels 80 and 90, (d) levels 75 and 95, (c)
levels 70 and 100, (b) levels 65 and 105, and (a) levels 60
and 110. Box-plots of the corresponding �̂ are given in
Fig. 4 (a). Notice that even the single threshold set of
restrictions considerably reduces � from the previous
median value of 73. However, no single one of these
R [ K partitions of the fhs; tig displays an outstanding
performance.
What about a multigrid e�ect? Figure 4(b) gives a second
set of �̂ box-plots, this time comparing the Gibbs sampler,
Higdon's method, and the threshold-based adaptive samp-
lers now incorporated into a multigrid pattern. Again the
multigrid variation appears more to ameliorate the e�ect
of a bad choice of cluster boundary than actually to
improve performance. The fact that the performances of
the Gibbs sampler and the modi®ed auxiliary variable
method are similar is maybe not surprising. Figure 5 shows
four widely spaced realizations of X from the model,
together with the ®nal marginal posterior mode estimate
(at each pixel s, the classi®cation is chosen to maximize
the conditional distribution of Xs given everything else).
Relatively few pixels alter between realizations, and most
of those that do are in the set identi®ed to be updated sep-
arately by the thresholding approach. In the multigrid,
MCMC updates of these single-pixel clusters in e�ect
attempt to draw from the conditional distribution (21).
If the Gibbs sampler's problems come from updating these
pixels one by one, then unfortunately so too will those of
the auxiliary variable method. Modifying the auxiliary
variable sampler may have helped identify where and
how sampling problems are occurring in this application,
Fig. 4. Box-plots of the estimated integrated autocorrelation times:
(a) Samplers not including any multigrid pattern: (b) Higdon's
method, the Gibbs sampler and multigrid samplers. (GS, Gibbs
sampler; H, Higdon's partially decoupled method)
Fig. 5. (a) Realization 1000; (b) Realization 2000; (c) Realization
3000; (d ) Realization 4000; (e) MPM estimate
42 Hurn
but it has not addressed the question of overcoming
them.
5. Discussion
Conceptually, updating more than one site at a time in
MCMC techniques should lead to considerable bene®ts in
terms of improved movement around the sample space.
The good performance of the Swendsen andWang algorithm
for multiple-site sampling from the Ising model has led to
much interest in more general auxiliary variable methods
for tackling other awkward distributions. In this paper, the
use of an auxiliary variable sampler has been considered in
two types of problem which MCMC methods are known to
®nd di�cult. In the ®rst application, the distribution of inter-
est exhibits severe multimodality, whereas in the second
application, there are strong interactions between the variable
components. In both cases, the auxiliary variable method
encountered quite serious problems linked to the unrestricted
growth of the clusters to be updated. To improve sampling it
is not su�cient to update any group of pixels simulta-
neously; it seems that the choice of the group must lead to
a constructive step around the state space.
There are at least two ways in which cluster growth can
be restricted in the auxiliary variable sampler. The ®rst is
to use Higdon's partial decoupling approach, the second
way is to group the sites using some spatial or data-driven
criteria and then use the results of this to guide the cluster
boundaries. Both of these methods have been investigated,
and both have been shown to lead to more e�ective samp-
lers. However, whether either of these modi®ed algorithms
then becomes more e�ective than the standard single-site
methods rather depends on the application. Although
only two applications have been considered, certain obser-
vations seem reasonable. The ®rst is that an adaptive aux-
iliary variable method can mainly be useful in problems
where modes of the target distribution correspond to con-
®gurations containing di�erent `objects' or large-scale fea-
tures, as was the case in the ®rst example considered.
Single-site samplers have di�culty moving between these
modes because the intermediate con®gurations may have
very low probability. Multiple-site updating, on the other
hand, does allow objects to be created or removed quickly,
but only when roughly the right sites are considered simul-
taneously. In such cases, any knowledge of approximately
where these changes take place can best be used in an adap-
tive segmentation; otherwise the partially decoupled
approach is very e�ective without requiring such informa-
tion. Results were rather more disappointing in the second
example, which exhibited strong component interactions.
Here the auxiliary variable samplers were able to isolate
where the sampling problems were occurring, but they
were not then able to improve performance, at least in the
modi®ed forms considered.
The conclusions from this study are not clear-cut. Cer-
tainly there are situations in which some form of auxiliary
variable sampler can signi®cantly outperform single-site
samplers. However, in order to do this, the auxiliary vari-
able sampler may require problem-by-problem modi®ca-
tions. Insight into what makes the problem hard to tackle
using a single-site sampler will be important in adapting
an auxiliary variable alternative.
Acknowledgements
Thanks to the Plant Sciences Department at Oxford
University for explaining the microscopy problem and
providing data. Thanks also to Peter Green and Chris
Jennison for helpful discussions. A Nu�eld Foundation
award is gratefully acknowledged.
References
Besag, J. and Green, P. (1993) Spatial statistics and Bayesian
computation. Journal of the Royal Statistical Society, B55,
25±38.
Besag, J., York, J. and Mollie, A. (1991) Bayesian image analysis,
with two applications in spatial statistics.Annals of the Insti-
tute of Statistical Mathematics, 43, 1±59.
Brooks, S. and Roberts, G. (1996) Diagnosing convergence of
Markov chain Monte Carlo algorithms. Technical Report,
Statistical Laboratory, University of Cambridge.
Cowles, M. and Carlin, B. (1996) Markov chainMonte Carlo con-
vergence diagnostics: a comparative review. Journal of the
American Statistical Association, 91, 883±904.
Fricker, M. and White N. (1992) Application of confocal micro-
scopy and three-dimensional image analysis to plant and
microbial cells. Binary, 4, 44±9.
Geman, S. andGeman, D. (1984) Stochastic relaxation, Gibbs dis-
tributions, and the Bayesian restoration of images. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
6, 721±41.
Geyer, C. (1994) Estimating normalizing constants and reweight-
ing mixtures in Markov chain Monte Carlo. Technical
Report, University of Minnesota.
Gilks, W., Clayton, D., Spiegelhalter, D., Best, N., McNeil,
A., Sharples, L. and Kirby, A. (1993) Modelling complexity:
applications of Gibbs sampling in medicine. Journal of the
Royal Statistical Society, B55, 39±52.
Gray, A. (1994) Simulating posterior Gibbs distributions: a com-
parison of the Swendsen±Wang and Gibbs sampler methods.
Statistics and Computing, 4, 189±201.
Green, P. and Han, X (1991) Metropolis methods, Gaussian pro-
posals and antithetic variables. In:Stochastic Models, Statis-
tical Methods and Algorithms in Image Analysis (P. Barone,
A. Frigessi and M. Piccioni, eds). Springer-Verlag, Berlin.
Han, X. (1993) Markov chain Monte Carlo and sampling
e�ciency. PhD Thesis, University of Bristol.
Hastings, W. (1970) Monte Carlo sampling methods using Mar-
kov chains and their applications.Biometrika, 57, 97±109.
43Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods
Heikkinen, J. and HoÈ gmander, H. (1994) Fully Bayesian approach
to image restoration with an application in biogeography.
Applied Statistics, 43, 569±82.
Higdon, D. (1993) Contribution to Discussion of Besag and
Green. Journal of the Royal Statistical Society, B55, 78.
Hurn,M. (1994) An adaptive Swendsen andWang type algorithm.
Technical Report 94:03, University of Bath.
Hurn, M. and Jennison, C. (1993) Multiple-site updates in
maximum a posteriori and marginal posterior modes
image estimation. In Statistics and Images, Volume 1,
155±86 (K. Mardia and G. Kanji, eds). Carfax, Oxford.
Hurn, M. and Jennison, C. (1995) A study of simulated annealing
and a revised cascade algorithm for image reconstruction.
Statistics and Computing, 5, 175±90.
Kandel, D., Domany, E. and Brandt, A. (1989) Simulations with-
out critical slowing down: Ising and three-state Potts models.
Physical Review, B40, 330±44.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and
Teller, E. (1953) Equations of state calculations by fast
computing machines. Journal of Chemical Physics, 21,
1087±92.
Mùller, J. (1993) Contribution to discussion of Besag and Green.
Journal of the Royal Statistical Society, B55, 84.
Ripley, B. (1987) Stochastic Simulation. Wiley, New York.
Smith, A. and Roberts, G. (1993) Bayesian computation via the
Gibbs sampler and related Markov chain Monte Carlo
Methods. Journal of the Royal Statistical Society, B55, 3±24.
Sokal, A. (1989) Monte Carlo Methods in Statistical Mechanics:
Foundations and New Algorithms. Cours de TroisieÁ me Cycle
de la Physique en Suisse Romande, Lausanne.
Swendsen, R. and Wang, J. (1987) Nonuniversal critical
dynamics in Monte Carlo simulations. Physical Review
Letters, 58, 86±8.
Wilson, T. (1990)Confocal Microscopy.Academic Press, London.
44 Hurn