difficulties in the use of auxiliary variables in markov chain monte carlo methods

10

Upload: merrilee-hurn

Post on 05-Aug-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

0960-3174# 1997 Chapman & Hall

Di�culties in the use of auxiliary variables

in Markov chain Monte Carlo methods

MERRILEE HURN

School of Mathematical Sciences, University of Bath, Bath BA2 7AY, UK

Received January 1996 and accepted July 1996

Markov chain Monte Carlo (MCMC) methods are now widely used in a diverse range of appli-

cation areas to tackle previously intractable problems. Di�cult questions remain, however, in

designing MCMC samplers for problems exhibiting severe multimodality where standard

methods may exhibit prohibitively slow movement around the state space. Auxiliary variable

methods, sometimes together with multigrid ideas, have been proposed as one possible way for-

ward. Initial disappointing experiments have led to data-driven modi®cations of the methods.

In this paper, these suggestions are investigated for lattice data such as is found in imaging and

some spatial applications. The results suggest that adapting the auxiliary variables to the

speci®c application is bene®cial. However the form of adaptation needed and the extent of

the resulting bene®ts are not always clear-cut.

Keywords: Adaptive methods, auxiliary variables, Bayesian image analysis, confocal ¯uores-

cence microscopy, Markov chain Monte Carlo, multigrid, multimodality

1. Introduction

Suppose that it is necessary to estimate some functional of a

random variableXwhich has distribution��x�. When exact

answers cannot be found, simulation from ��x� is often

used instead. But what if ��x� is su�ciently complex that

direct simulation is out of the question? Markov chain

Monte Carlo (MCMC) methods developed in response to

this problem (Metropolis et al. 1953; Hastings, 1970). A

Markov chain of X values is generated on the same state

space , where by design this iterative sequence converges

weakly to the desired target distribution��x�. In the cases

considered here,X will be de®ned for a discrete state space

. Conditions on the transition function of the chain

P�x! x0

� for the required convergence are that it is irre-

ducible, aperiodic, and has ��x� as its equilibrium distri-

bution. The Gibbs sampler (Geman and Geman, 1984)

and the Metropolis±Hastings algorithm are the two most

commonly used variants of this idea. See the papers by

Besag and Green (1993), Smith and Roberts (1993) and

Gilks et al. (1993) for a thorough discussion of MCMC

theory and application.

Two important questions which arise on a problem-by-

problem basis are the rate of convergence to the target dis-

tribution, and the accuracy of ergodic average estimation

from the chain of correlated samples. Green and Han

(1991) detail how these relate to the transition function

P�x! x0

�. Summaries of theoretical results in the assess-

ment of convergence, and extensive reviews of the various

output diagnostics used in practice can be found in Cowles

and Carlin (1996) and Brooks and Roberts (1996). Some

distributions are intrinsically harder to sample than others.

In particular, as might be expected, distributions exhibiting

a high degree of multimodality or strong correlations

between components of X are particularly awkward. It is

for such situations that auxiliary variable samplers have

been proposed. In Section 2, the auxiliary variable method

is described, together with two modi®cations which adapt

the algorithm to the particular application. These methods

are applied to an archaeological data set exhibitingmultimo-

dality in Section 3, and to an image analysis problem in Sec-

tion 4. Suggestions as to the type of problem where these

adaptive approaches can be useful are given in Section 5.

2. Auxiliary variable methods

2.1. The general case

The use of auxiliary variables in MCMC methods origi-

nates in the statistical physics literature. Swendsen and

Statistics and Computing (1997) 7, 35±44

Wang (1987) give a two-stage algorithm for simulating

from a supercritical Ising model for which long-range

correlations exist on the lattice. A model of this type pre-

sents great di�culty to samplers which update one site at

a time because the high interaction penalties lead to slow

convergence and highly correlated samples. The Swendsen

and Wang algorithm's good performance comes from its

ability to `decouple' some of the sites, leaving conditionally

independent single-valued groups of sites for updating. Sta-

tistical interpretation of the generalized algorithm is given

by Besag and Green (1993) in terms of auxiliary variables:

a new variable U, the auxiliary variable, is introduced by

de®ning its conditional distribution given X, ��ujx�. A

Markov chain is then constructed alternating between tran-

sitions onU, by drawing from��ujx�, and transitions onX.

To ensure that this two-step procedure has the desired tar-

get distribution ��x� as its stationary distribution, the tran-

sition function P�x! x0

ju� satis®es detailed balance with

respect to the conditional ��xju�. The simplest choice for

this is P�x! x0

ju� � ��x0

ju� (although see Mùller, 1993,

for one alternative).

To see how this construction might help, assume that the

distribution of interest is strictly positive forX in the sample

space, and that it can be written in the form

��x� /

Y

S

s�1

as�xs�Y

k

bk�x�; �1�

where s indexes the components ofX, as�xs� represents the

part of the model not due to component interaction, andk

indexes the interaction terms fbk�x�g. It is the interaction

terms fbk�x�g which hinder simulation. De®ne one com-

ponent of U for each interaction k, with these fUkg

conditionally independent givenX, and each with uniform

distribution

Ukjx � U�0; bk�x��: �2�

Simulation from the distribution��ujx� is straightforward.

Then, via the joint distribution ofX and U obtained from

(1) and (2),

��x0

ju� /

Y

S

s�1

as�x0

s�

Y

k

I�bk�x

0

��uk�: �3�

Simulation from ��x0

ju� would also be free of any inter-

action complications if it were not for the indicator terms,

Q

k I�bk�x0��uk�. Notice though that when uk � minx bk�x�

for any k, there is no restriction on the corresponding com-

ponents of x0

with all possible values satisfying I�bk�x

0

��uk�.

Groups of x0

components connected by a path of inter-

actions k for which each uk > minx bk�x� are known as

`clusters'. Within clusters, the indicator conditions may be

non-trivial to satisfy. However, between clusters any x0

will satisfy the indicator conditions, and thus clusters are

conditionally independent given U and can be updated

separately. The net result is to be able to update groups

of components, the clusters, simultaneously having killed

the between-cluster interactions.

Finally, irreducibility and aperiodicity of the transition

function P�x! x0

� can be noted in the following way. The

strict positivity of ��x� means that it is possible to move

from any x to any other x0

via a realization u in which each

interaction component satis®es uk � minx bk�x� (in e�ect

when all the clusters each consist of a single component).

2.2. The noisy Ising model

One situation in which the auxiliary variable construction is

easily implemented, and which is of considerable interest in

classi®cation problems such as arise in some imaging appli-

cations, is a noisy version of the Ising model. HereX is an

underlying classi®cation for example, in Section 3,X is the

presence or absence of previous human activity at 10m

spaced archaeological sites. Data fYsg are available on

the same lattice, and are assumed to be the typical response

of the classi®cation at site s to the measurement process,

�Xs; corrupted by pixel-wise independent additive Gaussian

noise of variance �2

. In the usual Bayesian framework, the

prior distribution on X is taken to be a form of the Ising

model with nearest neighbour spatial interactions hs; ti

and parameter � (see Besag et al., 1991), and the distribu-

tion of interest is the posterior of the classi®cation given

the data,

��xjy� /

Y

S

s�1

exp�ÿ�2�2

�ÿ1

�ys ÿ �xs�2

Y

hs;ti

exp�ÿ�I�xs 6�xt�

�: �4�

The Ising prior contributes the hs; ti interaction terms, and

the Gaussian likelihood the pixel-wise `preference' terms. In

this situation, the auxiliary variableUhs;ti follows one of two

uniform distributions depending on the currentx:

Uhs;tijx �

U�0; 1�

U�0; exp�ÿ���

if xs � xt

if xs 6� xt;�5�

and then

��x0

ju� /

Y

S

s�1

exp

ÿ�ys ÿ �xs�2

2�2

!

;

provided

exp�ÿ�I�x0s 6�x

0

t�� � u

hs;ti8hs; ti: �6�

When are the conditions on x0

in (6) satis®ed? Here

bhs;ti�x� is dichotomous, with minx bhs;ti�x� � exp�ÿ��. If

uhs;ti � exp�ÿ��, then any x

0

s and x0

t will satisfy the hs; ti

constraint. However if uhs;ti > exp�ÿ��, the requirement

is that x0

s equals x0

t. Values of uhs;ti � exp�ÿ�� could

have occurred either when xs � xt or when xs 6� xt, but

obtaining uhs;ti > exp�ÿ�� is only possible when xs � xt.

Expressed in words, the algorithm `bonds' neighbouring

36 Hurn

like-valuedX pixels with probability 1ÿ exp�ÿ��, and clus-

ters consist of these bonded pixels. In generating newX0

values from (6), the clusters are required to stay single-

valued; however, they are conditionally independent of

one another, and exact simulation from (6) is possible.

This capacity to update groups of pixels simultaneously is

particularly interesting when the value of � is high since

strong interaction terms lead to slow changes in the classi-

®cation when updating a single site at a time.

Unfortunately, studies of auxiliary variable samplers for

model (4) by Gray (1994) and Hurn and Jennison (1993)

conclude that the algorithm's performance is disappointing

and that approaches in which a single pixel is updated at a

time, for example the usual application of the Gibbs

sampler, can be more e�ective, particularly for high �

values. One explanation corroborated by monitoring

successive realizations is that at the U transition stage,

pixels are clustered according to the interaction terms ±

when � is high, clusters will tend to be large. These large

clusters are then updated according to the remaining non-

interaction model terms, the terms which express a pixel's

preference when seen in isolation for a particular value.

No account is taken in the clustering of whether a simulta-

neous update of these particular pixels to a common value

may be advantageous. This combination leads to slow

convergence and highly correlated samples. Two adaptive

strategies are now considered which attempt to overcome

this problem.

2.3. Partial decoupling

Higdon (1993) suggests a modi®cation to the algorithm for

the noisy Ising model which is intended to discourage pixels

with large di�erences in data values from bonding. Making

the straightforward generalization of Higdon's suggestion,

a positive constant �k is introduced for each interaction

and us

Ukjx � U�0; �bk�x��1=�k

�; �7�

so that

��x0

ju� /

Y

S

s�1

as�x0

s�

Y

k

�bk�x0

��1ÿ1=�kI

�bk�x0

��uk�: �8�

For the noisy Ising model, this modi®cation means that

neighbouring like-valued pixelss and t are bonded with prob-

ability 1ÿ exp�ÿ�=�hs;ti�; Higdon suggests using the values

�hs;ti � 1� jys ÿ ytj. The resulting clusters are still required

to be single-valued but are no longer conditionally indepen-

dent, retaining an interaction exp�ÿ��1ÿ 1=�hs;ti� I

�x0s 6�x0

t��

in (8) for each hs; ti between clusters. This means that it is

not possible to sample directly from (8) as it was from (6) pre-

viously; however, an MCMC step such as a Gibbs sampler

can be used to sample indirectly, generating a newx0

value

for each cluster in turn.

2.4. Multigrid implementation

The combination of multigrid methods with auxiliary vari-

ables has also been suggested (Besag and Green, 1993; Han,

1993; Kandel et al., 1989), although there is more than one

way to achieve this. If as suspected, one of the reasons for

poor performance is the uncontrolled growth of clusters,

then one way to limit cluster size would be to divide the lat-

tice into four smaller grids, with clusters not permitted to

grow beyond these spatial boundaries. These four grids

can then themselves be subdivided, and so on, the variation

in grid dimensions being the spatial multigrid component.

For example, on a 16� 16 grid, ®ve levels of blocking are

possible; if the ith level of block size, where the grids are

2

i� 2

ipixels in dimension, is denoted as level i, then a

`W'-cycle involves an update ofX at each level in the cycle

(0, 1, 2, 3, 4, 3, 4, 3, 2, 3, 4, 3, 4, 3, 2, 1) before repeating.

This is the approach taken by Han (1993).

The spatial blocking scheme does not adapt to the data in

that the same boundary pattern would be used whatever the

actual data values recorded on the lattice. In the somewhat

di�erent context of optimization using simulated annealing,

Hurn and Jennison (1995) showed that adaptive behaviour,

generating block updating schemes according to the data,

improved the performance. Hurn (1994) shows that the spa-

tial blocking scheme is just one of the possibilities for limit-

ing cluster size by only selectively `killing' interactions. The

interactions fkg can be partitioned into two disjoint sets,K

and R: K is the set of interactions which may be killed, and

R is the set of interactions which will be retained. By only

introducing components of the auxiliary variable U for

those k 2 K, the set R in e�ect de®nes boundaries across

which bonds cannot form. De®ne

Ukjx � U�0; bk�x�� for k 2 K �9�

as before. Interaction terms bk�x0

� must now be retained in

��x0

ju� across the remaining k, that is for those k in the set

R, so that

��x0

ju� /

Y

S

s�1

as�x0

s�

Y

k2R

bk�x0

Y

k2K

I�bk�x

0

��uk�: �10�

Clusters will not be conditionally independent if they are

linked by any k 2 R interaction, and the same comments

then apply to sampling from (10) as did to sampling from (8).

The modi®ed algorithm satis®es the necessary conditions

for convergence provided the Markov nature is not

destroyed by the choice of which auxiliary variables to

include. This partition of fkg into K and R is important

as it determines how the clustering will be controlled. One

possibility is to choose R spatially as already described,

dividing the grid into successively smaller grids. Another

option is to retain interactions across which there is a large

di�erence in pixel preference; for the noisy Ising model,R

could be fhs; tig for which jys ÿ ytj exceeds some speci®ed

threshold (the `multigrid' interpretation here would come

37Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods

from varying the threshold). The former of these allows

maximum cluster size to be strictly controlled, but does

not adapt to the data. The latter should discourage pixels

with widely di�erent preferences from bonding directly,

although since closed boundaries are not necessarily pro-

duced such pixels could still ®nd themselves in the same

large cluster.

3. Anarchaeological applicationexhibitingmultimodality

3.1. The estimation problem

Higdon (1993) suggests the partial decoupling approach for

a modi®cation of the small archaeological data set used by

Besag et al. (1991) and Gray (1994). The original data com-

prise 16� 16 measurements of log phosphate levels taken

from equally spaced soil samples. The objective of the

analysis was to estimate whether there had been previous

human activity, indicated by raised phosphate levels, at

each of the sites. A Bayesian image analysis approach was

used, specifying an eight-neighbour Ising type prior for

the presence/absence classi®cation X and assuming pixel-

wise independent additive Gaussian noise. The classi®-

cation at each site was then taken to be the most likely

under the posterior given everything else, an estimation

procedure which requires MCMC sampling. To create a

particularly di�cult sampling problem, Higdon con-

structed an arti®cial, bimodal, posterior surface based on

this example. The posterior to be sampled is

��xjy� /

Y

S

s�1

exp�ÿ�2�2

�ÿ1

�ys ÿ �xs�2

Y

hs;ti

exp�ÿ�hs;tiI�xs 6�xt��; �11�

where the parameters are taken to be �2

� 1, �presence

� 2,

�absence

� 1, �hs;ti � 0:768 for ®rst order neighbours, and

�hs;ti � 0:476 for second order neighbours.

3.2. Results

The sampling problem is su�ciently hard that a non-

standard measure of performance is used, the mean mode

swapping time (the number of updates of the grid taken

to move from one mode to the other and then back, aver-

aged over 2000 of these two-way swaps). Comparisons

are made between the Gibbs sampler, the auxiliary variable

method, Higdon's partially decoupled sampler, and two

multigrids. The ®rst of these is the spatial multigrid, divid-

ing the grid into four, sixteen, and so on. The second multi-

grid is an adaptive version, allowing bonding only over

those hs;ti with the lowest values of jys ÿ ytj; in order to

have a comparable multigrid structure, the level denoted i

retains a proportion 2

ÿiof the interactions in the partition

set R. The notation `04', for example, then denotes a `W'-

cycle beginning at level 0 and with lowest level 4.

Results are given in Table 1. Clearly the Gibbs sampler

and the auxiliary variable method both ®nd it di�cult to

move between modes, and Higdon's partially decoupled

method is a great improvement. In terms of the two multi-

grids, it seems that any limiting of the cluster size

encourages mode-swapping, although the extent varies con-

siderably. In particular, the level 33 spatial multigrid

achieves the lowest mean mode swap time. It might be

thought surprising that the non-adaptive multigrid should

outperform an adaptive version for certain cycle patterns.

However, the ®rst mode of (11), containing about 70% of

the probability, is a nearly all-absence classi®cation. The

second mode di�ers from this in that it contains one main

region of pixels classi®ed as presence. This group of pixels

lies largely in one corner of the grid, and so a restriction

of the clusters to within four quarter-grid sectors facilitates

the addition and deletion of the feature. Further investiga-

tions revealed that it is important to ®nd an appropriate

form of cluster limiting, with the improved performance

coming from this rather than from the multigrid variation.

4. An image analysis example with strong interactions

4.1. Confocal ¯uorescence microscopy

Confocal ¯uorescence microscopy is a technique used to

image three-dimensional volumes without the need for

physical sectioning. Images are built up pixel by pixel by

scanning the focal point of a laser beam systematically

through a specimen stained with ¯uorescent dye. The

improvements over standard microscopy are achieved by

arranging the lenses and pinholes to capture only that ¯uor-

escence emanating from the focal point of the laser. In

Table 1.Mean mode swapping times for the amended archaeological

problem

Spatial jys ÿ ytj Gibbs Higdon's Auxiliary

multigrid multigrid sampler sampler method

2265 256 2160

``W''-cycle

01 2046 2110

02 1643 1113

03 584 709

04 475 839

11 1993 2044

12 1479 880

13 508 653

22 1280 552

23 374 594

24 395 768

33 233 571

34 398 892

38 Hurn

theory, out-of-focus information is eliminated. In prac-

tice, as the focal point of the microscope moves deeper

into the specimen, the images become more degraded.

Di�cult estimation problems arise because the scattering

and attenuation degradation are due in part to the object

being imaged. See Wilson (1990), for example, for more

background and details.

4.2. A classi®cation model

Figure 1(a) shows the ®fth of 32 progressively deeper optical

sections of the stomatal guard cells of a plant leaf (see Fricker

andWhite 1992 for technical details). These data form part of

a sequence of images observing the opening and closing

responses of the cells. Questions of biological interest are,

for example, the volume of the cells, their surface area, gener-

ally how their shape changes during the response sequence.

To address these questions, the images are currently segmen-

ted into a binary cell/background classi®cation slice by slice

using simple thresholding followed by manual editing.

Figures 1(b)±(d) show three thresholded classi®cations

of the image, using thresholds of 65, 75, and 85 respectively

( pixels are classi®ed as cell if they have a recorded

¯uorescence value higher than the threshold value, otherwise

they are classi®ed as background).

As an alternative to thresholding, a low-level imaging

approach is adopted with the pixel-based classi®cation

fXsg being either cell (1) or background (0). The simplifying

model to be assumed is thatcell and background pixels each

have a typical ¯uorescence level, �1

and �0

respectively.

Again these typical levels are assumed to be recorded sub-

ject to additive, pixel-wise independent Gaussian noise

with variance �2

, giving rise to the dataY.

��yjx; �0

; �1

; �2

� �

Y

S

s�1

exp�ÿ�2�2

�ÿ1

�ys ÿ �xs�2

����������

2��2

p : �12�

As is common in binary image classi®cation problems,

the prior distribution on X is taken to be another Ising-

type model with parameter �,

��xj�� � Z���ÿ1 exp

ÿ�

X

hs;ti

I�xs 6�xt�

!

; �13�

where hs; ti here indicates ®rst order nearest neighbours,

and the � dependence of the distribution's normalizing con-

stant Z��� is explicitly noted.

Given a provisional estimate of the classi®cation, the

parameters �, �0

, �1

and �2

could be estimated via maxi-

mum likelihood or pseudo-likelihood methods. However,

the quality of such estimates will depend upon the accuracy

of the provision classi®cation, and in turn this will a�ect

the ®nal estimated classi®cation. Alternatively, it is poss-

ible to specify vague priors for the model parameters,

allowing them to vary along with the realizations of X.

See Heikkinen and HoÈ gmander (1994) for arguments in

favour of this approach. The following priors have been

chosen (note that the ¯uorescence is measured in the range

�0; 255�, and that the chosen range of� covers a wide range

of behaviour of the Ising model).

� � U�0; 2� �14�

1=�2

� ÿ�1=2; 1=2� �15�

��0

; �1

� � U��0; 255� � �0; 255�� �16�

Using (12)±(16), the posterior distribution ofX and the

remaining parameters conditioned on the observed data

can be written up to proportionality as

��x; �; �2

; �0

; �1

jy� / ��yjx; �2

; �0

; �1

� ��xj���������0

; �1

����2

�: �17�

4.3. Simulation from the model

The aim is to draw �x; �; �2

; �0

; �1

� realizations from (17) as

e�ciently as possible. Complexity usually means that the

sampling is done one parameter or one pixel at a time,

requiring the full conditional posterior distributions:

1

�2

j::: � ÿ�

�S � 1�

2

;

X

S

s�1

�ys ÿ �xs�2

� 1

!

=2�; �18�

Fig. 1. (a) Confocal microscopy data, slice 5; (b) Classi®cation

thresholding at 65; (c) Classi®cation thresholding at 75; (d ) Classi-

®cation thresholding at 85

39Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods

Fig. 2. Realizations of ®ve monitored quantities for the ®rst 250 sweeps overlaid by the ergodic averages formed from sweeps 1000 to 5000.

GS� Gibbs sampler (solid line), S&W� auxiliary variable method (broken line)

40 Hurn

�ij::: / N

"

1

ni

X

s:xs�i

ys;�2

ni

#

; �i 2 �0; 255�; i � 0; 1; �19�

���j:::� / Z���ÿ1 exp

ÿ �

X

hs;ti

I�xs 6�xt�

!

; � 2 �0; 2�; �20�

��xsj:::� / exp

ÿ �

X

t2�s

I�xs 6�xt�

ÿ �2�2

�ÿ1

�ys ÿ �xs�2

!

;

�21�

where the notation ::: denotes `all other variables',ni in (19)

is the number of pixels such that xs � i, and �s in (21)

indicates those pixels which are nearest neighbours of s.

Realizations from (18) and (19) can be drawn e�ciently

using the Gibbs sampler with standard distributional

methods (Ripley, 1987). Sampling the Ising parameter �

from (20) requires knowledge of the analytically intract-

able normalizing constant Z���. For this reason, work in

image analysis has generally treated � as a constant, esti-

mating it once initially, or sometimes periodically as

restoration progresses. However, Geyer (1994) proposes

the method of reverse logistic regression for o�-line estima-

tion of normalizing constants, at least up to proportion-

ality. This method is used here; the distribution (20) is

then known up to proportionality, and a Metropolis±

Hastings algorithm can be used.

The main aim of this section is to investigate the sampling

of X. It is clear from looking at the data that the interaction

parameter � will be high ± the scene is comprised of large

areas of the same classi®cation. Although the data will

tend to counteract this e�ect if the noise levels are low

enough, a general consequence is that single-site updating

algorithms ®nd it hard to alter boundary positions and

make other such changes. The question is whether some

form of the auxiliary variable method would be an

improvement by way of permitting groups of pixels to

change simultaneously.

4.4. Results

Two issues in MCMC performance are how quickly the

realizations settle to behave as though they were from the

target distribution, and the accuracy of the ergodic average

estimators. As a simple diagnostic for convergence, or

rather lack of it, it is common practice to plot realizations

of several scalar-valued functionals against iteration number.

To compare the Gibbs sampler and the auxiliary variable

method, Fig. 2 gives plots of this type for the observed values

of �, �0

, �1

, �2

and the number of cell pixels; the last of these

is closely related to the biological questions of interest, and is

known to be a good indicator of sampling di�culties (Gray,

1994). Each plot shows the ®rst 250 realizations from a total

of 5000, overlaid by the ergodic averages calculated from the

®nal 4000 sweeps. Stability of �0

, �1

, �2

and the number of

cell pixels appears to be attained relatively rapidly, within

perhaps the ®rst 25 iterations. The convergence of� is slightly

less impressive, requiring closer to 150 iterations. Judging by

all ®ve monitored quantities, the auxiliary variable sampler

appears to provide slower convergence.

The e�ciency of estimation is usually measured by the

integrated autocorrelation time � de®ned as follows: as

the number of iterations at equilibrium tends to in®nity,

the asymptotic variance of the ergodic average of some

functional f � � � tends to �� the variance of f � � � which

would be obtained under independent sampling from the

stationary distribution. Sokal's method (Sokal, 1989) is

used to estimate � , taking the number of cell pixels as the

functional f � � �. Using the Gibbs sampler, the median �̂

over 20 repetitions of the simulation is 7.7. However, using

the auxiliary variable method the median �̂ is 73.0. In other

words, roughly 10 timesmore sweeps of the auxiliary vari-

able method than of the Gibbs sampler are needed to

achieve the same accuracy.

Can the performance of the Gibbs sampler be improved

upon by modifying the auxiliary variable clustering? A

spatial multigrid may not be the most appropriate form

Fig. 3.Non-bonding boundaries: (a) Level 1; (b) Level 2; (c) Level 3;

(d ) Level 4; (e) Level 5; ( f ) Level 6

41Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods

to use here; unrestricted cluster growth would still seem rea-

sonable in the regions which are clearly comprised of just a

single classi®cation, for example most of the background.

One way to accommodate this would be to prevent bond-

ing across the class boundaries formed when the data are

thresholded; Fig. 3(f ) shows these boundaries when the

threshold level is set at 85. It seems unlikely however

that a single fairly arbitrary thresholding will be able to

in¯uence the clustering su�ciently. Figures 3(a)±(e)

show the boundaries which result from considering two

threshold levels simultaneously. Interactions hs; ti will be

retained in the set R, and thus the corresponding

exp�ÿ� I�x0s 6�x

0

t�� will be in ��x

0

ju; :::�, either if they lie across

one of the threshold boundaries, or if they involve a

pixel assigned di�erent classi®cations under the di�erent

threshold levels. In this way, `undecided' pixels should be

excluded from large clusters. The number of such pixels

depends on the separation of the two threshold levels:

Fig. 3(e) uses levels 80 and 90, (d) levels 75 and 95, (c)

levels 70 and 100, (b) levels 65 and 105, and (a) levels 60

and 110. Box-plots of the corresponding �̂ are given in

Fig. 4 (a). Notice that even the single threshold set of

restrictions considerably reduces � from the previous

median value of 73. However, no single one of these

R [ K partitions of the fhs; tig displays an outstanding

performance.

What about a multigrid e�ect? Figure 4(b) gives a second

set of �̂ box-plots, this time comparing the Gibbs sampler,

Higdon's method, and the threshold-based adaptive samp-

lers now incorporated into a multigrid pattern. Again the

multigrid variation appears more to ameliorate the e�ect

of a bad choice of cluster boundary than actually to

improve performance. The fact that the performances of

the Gibbs sampler and the modi®ed auxiliary variable

method are similar is maybe not surprising. Figure 5 shows

four widely spaced realizations of X from the model,

together with the ®nal marginal posterior mode estimate

(at each pixel s, the classi®cation is chosen to maximize

the conditional distribution of Xs given everything else).

Relatively few pixels alter between realizations, and most

of those that do are in the set identi®ed to be updated sep-

arately by the thresholding approach. In the multigrid,

MCMC updates of these single-pixel clusters in e�ect

attempt to draw from the conditional distribution (21).

If the Gibbs sampler's problems come from updating these

pixels one by one, then unfortunately so too will those of

the auxiliary variable method. Modifying the auxiliary

variable sampler may have helped identify where and

how sampling problems are occurring in this application,

Fig. 4. Box-plots of the estimated integrated autocorrelation times:

(a) Samplers not including any multigrid pattern: (b) Higdon's

method, the Gibbs sampler and multigrid samplers. (GS, Gibbs

sampler; H, Higdon's partially decoupled method)

Fig. 5. (a) Realization 1000; (b) Realization 2000; (c) Realization

3000; (d ) Realization 4000; (e) MPM estimate

42 Hurn

but it has not addressed the question of overcoming

them.

5. Discussion

Conceptually, updating more than one site at a time in

MCMC techniques should lead to considerable bene®ts in

terms of improved movement around the sample space.

The good performance of the Swendsen andWang algorithm

for multiple-site sampling from the Ising model has led to

much interest in more general auxiliary variable methods

for tackling other awkward distributions. In this paper, the

use of an auxiliary variable sampler has been considered in

two types of problem which MCMC methods are known to

®nd di�cult. In the ®rst application, the distribution of inter-

est exhibits severe multimodality, whereas in the second

application, there are strong interactions between the variable

components. In both cases, the auxiliary variable method

encountered quite serious problems linked to the unrestricted

growth of the clusters to be updated. To improve sampling it

is not su�cient to update any group of pixels simulta-

neously; it seems that the choice of the group must lead to

a constructive step around the state space.

There are at least two ways in which cluster growth can

be restricted in the auxiliary variable sampler. The ®rst is

to use Higdon's partial decoupling approach, the second

way is to group the sites using some spatial or data-driven

criteria and then use the results of this to guide the cluster

boundaries. Both of these methods have been investigated,

and both have been shown to lead to more e�ective samp-

lers. However, whether either of these modi®ed algorithms

then becomes more e�ective than the standard single-site

methods rather depends on the application. Although

only two applications have been considered, certain obser-

vations seem reasonable. The ®rst is that an adaptive aux-

iliary variable method can mainly be useful in problems

where modes of the target distribution correspond to con-

®gurations containing di�erent `objects' or large-scale fea-

tures, as was the case in the ®rst example considered.

Single-site samplers have di�culty moving between these

modes because the intermediate con®gurations may have

very low probability. Multiple-site updating, on the other

hand, does allow objects to be created or removed quickly,

but only when roughly the right sites are considered simul-

taneously. In such cases, any knowledge of approximately

where these changes take place can best be used in an adap-

tive segmentation; otherwise the partially decoupled

approach is very e�ective without requiring such informa-

tion. Results were rather more disappointing in the second

example, which exhibited strong component interactions.

Here the auxiliary variable samplers were able to isolate

where the sampling problems were occurring, but they

were not then able to improve performance, at least in the

modi®ed forms considered.

The conclusions from this study are not clear-cut. Cer-

tainly there are situations in which some form of auxiliary

variable sampler can signi®cantly outperform single-site

samplers. However, in order to do this, the auxiliary vari-

able sampler may require problem-by-problem modi®ca-

tions. Insight into what makes the problem hard to tackle

using a single-site sampler will be important in adapting

an auxiliary variable alternative.

Acknowledgements

Thanks to the Plant Sciences Department at Oxford

University for explaining the microscopy problem and

providing data. Thanks also to Peter Green and Chris

Jennison for helpful discussions. A Nu�eld Foundation

award is gratefully acknowledged.

References

Besag, J. and Green, P. (1993) Spatial statistics and Bayesian

computation. Journal of the Royal Statistical Society, B55,

25±38.

Besag, J., York, J. and Mollie, A. (1991) Bayesian image analysis,

with two applications in spatial statistics.Annals of the Insti-

tute of Statistical Mathematics, 43, 1±59.

Brooks, S. and Roberts, G. (1996) Diagnosing convergence of

Markov chain Monte Carlo algorithms. Technical Report,

Statistical Laboratory, University of Cambridge.

Cowles, M. and Carlin, B. (1996) Markov chainMonte Carlo con-

vergence diagnostics: a comparative review. Journal of the

American Statistical Association, 91, 883±904.

Fricker, M. and White N. (1992) Application of confocal micro-

scopy and three-dimensional image analysis to plant and

microbial cells. Binary, 4, 44±9.

Geman, S. andGeman, D. (1984) Stochastic relaxation, Gibbs dis-

tributions, and the Bayesian restoration of images. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

6, 721±41.

Geyer, C. (1994) Estimating normalizing constants and reweight-

ing mixtures in Markov chain Monte Carlo. Technical

Report, University of Minnesota.

Gilks, W., Clayton, D., Spiegelhalter, D., Best, N., McNeil,

A., Sharples, L. and Kirby, A. (1993) Modelling complexity:

applications of Gibbs sampling in medicine. Journal of the

Royal Statistical Society, B55, 39±52.

Gray, A. (1994) Simulating posterior Gibbs distributions: a com-

parison of the Swendsen±Wang and Gibbs sampler methods.

Statistics and Computing, 4, 189±201.

Green, P. and Han, X (1991) Metropolis methods, Gaussian pro-

posals and antithetic variables. In:Stochastic Models, Statis-

tical Methods and Algorithms in Image Analysis (P. Barone,

A. Frigessi and M. Piccioni, eds). Springer-Verlag, Berlin.

Han, X. (1993) Markov chain Monte Carlo and sampling

e�ciency. PhD Thesis, University of Bristol.

Hastings, W. (1970) Monte Carlo sampling methods using Mar-

kov chains and their applications.Biometrika, 57, 97±109.

43Di�culties in the use of auxiliary variables in Markov chain Monte Carlo methods

Heikkinen, J. and HoÈ gmander, H. (1994) Fully Bayesian approach

to image restoration with an application in biogeography.

Applied Statistics, 43, 569±82.

Higdon, D. (1993) Contribution to Discussion of Besag and

Green. Journal of the Royal Statistical Society, B55, 78.

Hurn,M. (1994) An adaptive Swendsen andWang type algorithm.

Technical Report 94:03, University of Bath.

Hurn, M. and Jennison, C. (1993) Multiple-site updates in

maximum a posteriori and marginal posterior modes

image estimation. In Statistics and Images, Volume 1,

155±86 (K. Mardia and G. Kanji, eds). Carfax, Oxford.

Hurn, M. and Jennison, C. (1995) A study of simulated annealing

and a revised cascade algorithm for image reconstruction.

Statistics and Computing, 5, 175±90.

Kandel, D., Domany, E. and Brandt, A. (1989) Simulations with-

out critical slowing down: Ising and three-state Potts models.

Physical Review, B40, 330±44.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and

Teller, E. (1953) Equations of state calculations by fast

computing machines. Journal of Chemical Physics, 21,

1087±92.

Mùller, J. (1993) Contribution to discussion of Besag and Green.

Journal of the Royal Statistical Society, B55, 84.

Ripley, B. (1987) Stochastic Simulation. Wiley, New York.

Smith, A. and Roberts, G. (1993) Bayesian computation via the

Gibbs sampler and related Markov chain Monte Carlo

Methods. Journal of the Royal Statistical Society, B55, 3±24.

Sokal, A. (1989) Monte Carlo Methods in Statistical Mechanics:

Foundations and New Algorithms. Cours de TroisieÁ me Cycle

de la Physique en Suisse Romande, Lausanne.

Swendsen, R. and Wang, J. (1987) Nonuniversal critical

dynamics in Monte Carlo simulations. Physical Review

Letters, 58, 86±8.

Wilson, T. (1990)Confocal Microscopy.Academic Press, London.

44 Hurn