sequential monte carlo tracking by fusing multiple cues in video sequences

11
Sequential Monte Carlo tracking by fusing multiple cues in video sequences Paul Brasnett a , Lyudmila Mihaylova b, * , David Bull a , Nishan Canagarajah a a Department of Electrical and Electronic Engineering, University of Bristol, Bristol BS8 1UB, UK b Department of Communication Systems, Lancaster University, Lancaster LA1 4WA, UK Received 16 February 2006; received in revised form 3 July 2006; accepted 12 July 2006 Abstract This paper presents visual cues for object tracking in video sequences using particle filtering. A consistent histogram-based framework is developed for the analysis of colour, edge and texture cues. The visual models for the cues are learnt from the first frame and the track- ing can be carried out using one or more of the cues. A method for online estimation of the noise parameters of the visual models is presented along with a method for adaptively weighting the cues when multiple models are used. A particle filter (PF) is designed for object tracking based on multiple cues with adaptive parameters. Its performance is investigated and evaluated with synthetic and natural sequences and compared with the mean-shift tracker. We show that tracking with multiple weighted cues provides more reliable perfor- mance than single cue tracking. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Particle filtering; Tracking in video sequences; Colour; Texture; Edges; Multiple cues; Bhattacharyya distance 1. Introduction Object tracking is required in many vision applications such as human–computer interfaces, video communica- tion/compression, road traffic control, security and surveil- lance systems. Often the goal is to obtain a record of the trajectory of one or more targets over time and space. Object tracking in video sequences is a challenging task because of the large amount of data used and the common requirement for real-time computation. Moreover, most of the models encountered in visual tracking are nonlinear, non-Gaussian, multi-modal or any combination of these. In this paper, we focus on Monte Carlo methods (parti- cle filtering) for tracking in video sequences. Particle filter- ing, also known as the Condensation algorithm [1] and bootstrap filter [2], has recently been proven to be a power- ful and reliable tool for nonlinear systems [2–4]. Particle filtering is a promising technique because of its inherent property to allow fusion of different sensor data, to account for different uncertainties, to cope with data association problems when multiple targets are tracked with multiple sensors and to incorporate constraints. They keep track of the state through sample-based representation of proba- bility density functions. Here, we develop a particle filtering technique for object tracking in video sequences by visual cues. Further, methods are presented for combining the cues if they are assumed to be independent. By comparing results from single-cue tracking with multiple cues we show that multiple complementary cues can improve the accuracy of tracking. The features and their parameters are adaptively chosen based on appropriately defined dis- tance function. The developed particle filter together with a mixed dynamic model enables recovery after a partial or full loss. Different algorithms have been proposed for visual tracking and their particularities are mainly application dependent. Many of them rely on a single cue, which can 0262-8856/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2006.07.017 * Corresponding author. Tel.: +44 1524 510388. E-mail addresses: [email protected] (P. Brasnett), mila. [email protected] (L. Mihaylova). www.elsevier.com/locate/imavis Image and Vision Computing xxx (2006) xxx–xxx ARTICLE IN PRESS Please cite this article as: Paul Brasnett et al., Sequential Monte Carlo tracking by fusing multiple cues ..., Image and Vision Comput- ing (2006), doi:10.1016/j.imavis.2006.07.017

Upload: sheffield

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ARTICLE IN PRESS

www.elsevier.com/locate/imavis

Image and Vision Computing xxx (2006) xxx–xxx

Sequential Monte Carlo tracking by fusing multiple cuesin video sequences

Paul Brasnett a, Lyudmila Mihaylova b,*, David Bull a, Nishan Canagarajah a

a Department of Electrical and Electronic Engineering, University of Bristol, Bristol BS8 1UB, UKb Department of Communication Systems, Lancaster University, Lancaster LA1 4WA, UK

Received 16 February 2006; received in revised form 3 July 2006; accepted 12 July 2006

Abstract

This paper presents visual cues for object tracking in video sequences using particle filtering. A consistent histogram-based frameworkis developed for the analysis of colour, edge and texture cues. The visual models for the cues are learnt from the first frame and the track-ing can be carried out using one or more of the cues. A method for online estimation of the noise parameters of the visual models ispresented along with a method for adaptively weighting the cues when multiple models are used. A particle filter (PF) is designed forobject tracking based on multiple cues with adaptive parameters. Its performance is investigated and evaluated with synthetic and naturalsequences and compared with the mean-shift tracker. We show that tracking with multiple weighted cues provides more reliable perfor-mance than single cue tracking.� 2006 Elsevier B.V. All rights reserved.

Keywords: Particle filtering; Tracking in video sequences; Colour; Texture; Edges; Multiple cues; Bhattacharyya distance

1. Introduction

Object tracking is required in many vision applicationssuch as human–computer interfaces, video communica-tion/compression, road traffic control, security and surveil-lance systems. Often the goal is to obtain a record of thetrajectory of one or more targets over time and space.Object tracking in video sequences is a challenging taskbecause of the large amount of data used and the commonrequirement for real-time computation. Moreover, most ofthe models encountered in visual tracking are nonlinear,non-Gaussian, multi-modal or any combination of these.

In this paper, we focus on Monte Carlo methods (parti-cle filtering) for tracking in video sequences. Particle filter-ing, also known as the Condensation algorithm [1] andbootstrap filter [2], has recently been proven to be a power-ful and reliable tool for nonlinear systems [2–4]. Particle

0262-8856/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.imavis.2006.07.017

* Corresponding author. Tel.: +44 1524 510388.E-mail addresses: [email protected] (P. Brasnett), mila.

[email protected] (L. Mihaylova).

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

filtering is a promising technique because of its inherentproperty to allow fusion of different sensor data, to accountfor different uncertainties, to cope with data associationproblems when multiple targets are tracked with multiplesensors and to incorporate constraints. They keep trackof the state through sample-based representation of proba-bility density functions. Here, we develop a particle filteringtechnique for object tracking in video sequences by visualcues. Further, methods are presented for combining thecues if they are assumed to be independent. By comparingresults from single-cue tracking with multiple cues we showthat multiple complementary cues can improve theaccuracy of tracking. The features and their parametersare adaptively chosen based on appropriately defined dis-tance function. The developed particle filter together witha mixed dynamic model enables recovery after a partialor full loss.

Different algorithms have been proposed for visualtracking and their particularities are mainly applicationdependent. Many of them rely on a single cue, which can

o tracking by fusing multiple cues ..., Image and Vision Comput-

2 P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx

ARTICLE IN PRESS

be chosen according to the application context, e.g., in [5] acolour-based particle filter is developed. The colour-basedparticle filter has been shown [5] to outperform the mean-shift tracker proposed in [6,7] in terms of reliability, atthe price of increased computational time. However, boththe particle filtering and mean shift tracking methods havereal-time capabilities. Colour cues form a significant partof many tracking algorithms [8–12,5]. The advantage ofcolour is that it is a weak model and is therefore unrestric-tive about the type of objects being tracked. The mainproblem for tracking with colour alone occurs when theregion around the target object contains objects with simi-lar colour. When the region is cluttered in this way a singlecue does not provide reliable performance because it failsto fully model the target. Stronger models have been usedbut they rely on off-line learning and modelling of fore-ground and background models [1,13]. Multiple-cue track-ing provides more information about the object and hencethere is less opportunity for clutter to influence the result.In [9], colour cues are combined with motion and soundcues to provide better results. Motion and sound are bothintermittent cues and therefore cannot always be reliedupon. Colour and shape cues are used in [8], where shapeis described using a parameterised rectangle or ellipse.The cues are combined by weighting each one based uponthe performance in previous frames. A cue-selectionapproach to optimise the use of the cues is proposed in[14] which is embedded in a hierarchical vision-based track-ing algorithm. When the target is lost, layers cooperate toperform a rapid search for the target and continue track-ing. Another approach, called democratic integration [15]implements cues concurrently where all vision cues arecomplementary and contribute simultaneously to the over-all result. The integration is performed through saliencymaps and the result is a weighted average of saliency maps.Robustness and generality are major features of thisapproach resulting from the combination of the cues.Adaption schemes are sometimes used to handle changesto the appearance of the object being tracked [16]. Carefuldesign has to make a trade off between adaption to rapidlychanging appearance with adapting too quickly to incor-rect regions. The work here does not involve an appearanceadaption scheme but an existing scheme (e.g., [16]) could beadopted within the framework. In this paper, we also showthat tracking with multiple weighted cues provides morereliable and accurate results. A framework is suggestedfor combining colour, texture and edge cues to providerobust and accurate tracking without the need for extensiveoff-line modelling. It is an extension and generalisation ofthe results reported in [17], with colour and texture only.A comparison is presented in [17] of a particle filter witha Gaussian sum particle filter, working separately with col-our, with texture, and both with colour and texture fea-tures. Adaptive colour and texture segmentation fortracking moving objects is proposed in [11]. Texture ismodelled by an autobinomial Gibbs Markov random field,whilst colour is modelled by a two-dimensional Gaussian

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

distribution. In our paper, texture is represented using asteerable pyramid decomposition which is a differentapproach to [11] but is related to the work in [16]. Addi-tionally, in [11] segmentation-based tracking with a Kal-man filter is considered instead of feature-based trackingproposed here.

The paper is organised as follows. Section 2 states theproblem of visual tracking within a sequential MonteCarlo framework and presents the particle filter (PF)based on single or multiple information cues. Section 3introduces the model of the region of interest. Section 4describes the cues and their likelihoods used for the track-ing process. Methods for adaptively changing the cues’noise parameters and adaptively weighting the cues arepresented. The overall likelihood function of the particlefilter represents a product of the separate cues. Section5 investigates the particle filter performance and validatesit over different scenarios. We show the advantages offusing multiple cues compared to single-cue tracking usingsynthetic and natural video sequences. A comparison withthe mean-shift algorithm is performed over natural videosequences and their computational time is characterised.Results are also presented for partial and full occlusions.Finally, Section 6 discusses the results and open issues forfuture research.

2. Sequential Monte Carlo framework

The aim of sequential Monte Carlo estimation is to eval-uate the posterior probability density function (pdf)p(xk|Zk) of the state vector xk 2 Rnx , with dimension nx,given a set Zk = {z1, . . . ,zk} of sensor measurements upto time k. The Monte Carlo approach relies on a sample-based construction to represent the state pdf. Multipleparticles (samples) of the state are generated, each oneassociated with a weight W ð‘Þ

k which characterises the qual-ity of a specific particle ‘, ‘ = 1,2, . . . ,N.

An estimate of the variable of interest is obtained by theweighted sum of particles. Two major stages can be distin-guished: prediction and update. During prediction, eachparticle is modified according to the state model of theregion of interest in the video frame, including the additionof random noise in order to simulate the effect of the noiseon the state. In the update stage, each particle’s weight isre-evaluated based on the new data.

An inherent problem with particle filters is degeneracy,the case when only one particle has a significant weight.An estimate of the measure of degeneracy [18] at time k

is given as

N eff ¼1PN

‘¼1ðWð‘Þk Þ

2: ð1Þ

If the value of Neff is below a user-defined threshold Nthres aresampling procedure can help to avoid degeneracy byeliminating particles with small weights and replicatingparticles with larger weights.

o tracking by fusing multiple cues ..., Image and Vision Comput-

P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx 3

ARTICLE IN PRESS

2.1. A particle filter for object tracking using multiple cues

Within the Bayesian framework, the conditional pdfp(xk+1|Zk) is recursively updated according to the predic-

tion step

pðxkþ1jZkÞ ¼Z

Rnx

pðxkþ1jxkÞpðxkjZkÞdxk ð2Þ

and the update step

pðxkþ1jZkþ1Þ ¼pðzkþ1jxkþ1Þpðxkþ1jZkÞ

pðzkþ1jZkÞ; ð3Þ

where p(zk+1|Zk) is a normalising constant. The recursiveupdate of p(xk+1|Zk+1) is proportional to

pðxkþ1jZkþ1Þ / pðzkþ1jxkþ1Þpðxkþ1jZkÞ: ð4ÞUsually, there is no simple analytical expression for prop-agating p(xk+1|Zk+1) through (4) so numerical methodsare used.

In the particle filter approach, a set of N weighted par-ticles, drawn from the posterior conditional pdf, is usedto map integrals to discrete sums. The posteriorp(xk+1|Zk+1) is approximated by

pðxkþ1jZkþ1Þ �XN

‘¼1

bW ð‘Þkþ1dðxkþ1 � x

ð‘Þkþ1Þ; ð5Þ

where bW ð‘Þk are the normalised importance weights. New

weights are calculated, putting more weight on particlesthat are important according to the posterior pdf (5).

It is often impossible to sample directly from the posteriordensity function p(xk+1|Zk+1). This difficulty is circumventedby making use of the importance sampling from a knownproposal distribution p(xk+1|xk). The particle filter is givenin Table 1. The residual resampling algorithm described in[19,20] is applied at step (7). This is a two-step process mak-ing use of the sampling-importance-resampling scheme.

Table 1The particle filter with multiple cues

Initialisation

(1) For ‘ = 1, . . ., N, generate samples fxð‘Þ0 g from the prior distribution p(x0)For k = 0,1,2, . . . ,

Prediction step

(2) For ‘ = 1, . . . ,N, sample xð‘Þkþ1 � pðxkþ1jxð‘Þk Þ from the dynamic model pres

Measurement update: evaluate the importance weights(3) Compute the weights W ð‘Þ

kþ1 / W ð‘Þk Lðzkþ1jxð‘Þkþ1Þ based on the likelihood L

(4) Normalise the weights, bW ð‘Þkþ1 ¼

W ð‘Þkþ1PN

‘¼1W ð‘Þkþ1

.

Output

(5) The state estimate xkþ1 is the probabilistically averaged sumxkþ1 ¼

PN‘¼1bW ð‘Þ

kþ1xð‘Þkþ1:

(6) Estimate the effective number of particles Neff

N eff ¼ 1PN

‘¼1ðbW ð‘Þ

kþ1Þ2

.

If N eff 6 N thres then perform resampling.

Resampling step

(7) Multiply/suppress samples xð‘Þkþ1 with high/low importance weights bW ð‘Þ

kþ1.(8) For ‘ = 1, . . . ,N, set W ð‘Þ

kþ1 ¼ bW ð‘Þkþ1 ¼ 1=N .

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

3. Dynamic models

The considered model for the moving object providesinvariance to different motions, such as translations, rota-tions and to changes in the object size. This allows to coverthe different types of motion of the object, also the casewhen the object size varies considerably (the object getcloser to the camera or moves far away from it) and henceensures reliable performance of the PF. In our particularimplementation two generic models are used. We adopteda constant velocity model for the translational motionand the random walk model for the rotation and scaling.A mixed dynamic motion model is also presented whichallows more than one model to be used for dealing withocclusions.

For the purpose of tracking an object in video sequenceswe initially choose a region which defines the object. Theshape of this region is fixed a priori and here is a rectangu-lar box.

Denote by (x,y) the coordinates of the centre of the rect-angular region, by h the angle through which the region isrotated, and by s the scale, ð _x; _yÞ are the respective velocitycomponents.

3.1. Constant velocity model for translational motion

The translational motion of the region of interest in x

direction can be described by a constant velocity model [21]

xkþ1 ¼ Fxk þ wk; wkvNð0;QÞ; ð6Þwhere the state vector is x ¼ ðx; _xÞT. The matrix F

F ¼1 T

0 1

� �describes the dynamics of the state over time and T is thesampling interval. The system noise wk is assumed to be

. Set initial weights W ð‘Þ0 ¼ 1=N .

ented in Section 3.

ðzkþ1jxð‘Þkþ1Þ given in Section 4.

o tracking by fusing multiple cues ..., Image and Vision Comput-

4 P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx

ARTICLE IN PRESS

a zero-mean white Gaussian sequence, wkvNð0;QÞ, withthe covariance matrix

Q ¼14T 4 1

2T 3

12T 3 T 2

!r2 ð7Þ

and r is the noise standard deviation.

3.2. Random walk model for rotational motion and for the

scale

A random walk model propagates the state x = (h, s)T

by

xkþ1 ¼ xk þ wk; ð8Þwhere wkvNð0;QÞ is a zero-mean Gaussian noise, withcovariance matrix Q ¼ diagfr2

h; r2sg, describing the uncer-

tainty in the state vector.

3.3. Multi-component state

The motion of the object being tracked is describedusing the translation (x,y), rotation (h) and scaling (s) com-ponents. The translation components are modelled by theconstant velocity model (6) and the rotation and scalingcomponents are modelled using the random walk model(8). The full augmented state of the region is then given as

x ¼ ðx; _x; y; _y; h; sÞT: ð9ÞThe dynamics of the full state can then be modelled as

xkþ1 ¼ Gxk þ wk; wkvNð0;QÞ; ð10Þwhere the matrix G has the form

G ¼

F 02�2 02�1 02�1

02�2 F 02�1 02�1

01�2 01�2 1 0

01�2 01�2 0 1

0BBB@1CCCA: ð11Þ

The covariance matrix of the zero-mean Gaussian is

Q ¼

Qx 02�2 02�1 02�1

02�2 Qy 02�1 02�1

01�2 01�2 r2h 0

01�2 01�2 0 r2s

0BBB@1CCCA; ð12Þ

where Qx is the covariance matrix of the constant-velocitymodel for the x component (7), Qy is the covariance matrixof the y component and r2

h and r2s are the covariances for

the rotation (h) and scaling (s).

3.4. Mixed dynamic model

A mixed-dynamic model allows the system to bedescribed through more than one dynamic model[16,22,23] and provides abilities to the tracking algorithmto cope with occlusions. Here, we make use of two models:the constant velocity model as given in Section 3.1 and the

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

other is the reinitialisation model described below. Whenthe object is occluded the tracker might lose it temporarily.In this case the reinitialisation model that ensures uniformspread of particles along the image guarantees robustness.The new location of the object is recovered after processingthe information from separate cues in the measurementupdate step.

Samples are generated from the constant velocity modelwith probability j, set a priori, and from the reinitialisationmodel with probability 1 � j. Hence, from the mixed modelsamples are generated by the following steps:

1. Generate a number c 2 [0,1) from a uniform distributionU

2. If c > j, then sample from the constant velocity model (6)3. else use the reinitialisation model

pðxkþ1jxkÞ � Uð0; xmaxÞ; ð13Þwhere xmax is a vector with the maximum allowed valuesfor the state vector components. This is repeated until therequired number of samples are obtained.

4. Likelihood models

This section describes how we model the separate cues ofthe rectangular region Sx, surrounding the moving objectand the likelihood models of the cues. One of the particu-larities of tracking moving objects in video sequences com-pared to other tracking problems, such as tracking ofairborne targets with radar data, is that there are no mea-surement models in explicit form. The estimated state vari-ables of the object are connected to features of the videosequences. Practically, the likelihood models of the featuresprovide information about the changes in the motion of theobject.

All of the models are based on histograms. Histogramshave the useful property that they allow some change in theobject appearance without changing the histogram.

4.1. Colour cue

Colour cues are flexible in the type of object that theycan be used to track. However, the main drawbacks of col-our cues are:

• the effect of other similar coloured regions and• the lack of discrimination with respect to rotation (obvi-

ous on Fig. 1).

A histogram, hx ¼ ðh1;x ; . . . ; hBC;xÞ, for a region Sx corre-

sponding to a state x is given by

hi;x ¼Xu2Sx

diðbuÞ; i ¼ 1 . . . BC; ð14Þ

where di is the Kronecker-d function at the bin index i,bu 2 {1, . . . ,BC} is the histogram bin index associated with

o tracking by fusing multiple cues ..., Image and Vision Comput-

Fig. 1. Colour cues provide a flexible model for tracking but lackdiscrimination with respect to rotation. The time board (a) is tracked withcolour cues. It can be seen that at frame 40 (b) the region has rotated, thishas got worse by frame 70 (c).

P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx 5

ARTICLE IN PRESS

the intensity at pixel location u = (x,y) and BC is the num-ber of bins in each colour channel. The histogram is nor-malised such that

PBC

i¼1hi;x ¼ 1.A histogram is constructed for each channel in the col-

our space. For example, we use 8 · 8 · 8 bin histogramsin the three channels of red, green and blue (RGB) colourspace [9], other colour spaces could be used to improverobustness to illumination or appearance changes.

4.2. Texture cue

Although there is no unique definition of texture, it isgenerally agreed that texture describes the spatial arrange-ments of pixel levels in an image, which may be stochasticor periodic, or both [24]. Texture can be qualitatively char-acterised such as fine, coarse, grained and smooth. When atexture is viewed from a distance it may appear to be fine,

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

however, when viewed from close up it may appear to becoarse.

The texture description used for this work is based onsteerable pyramid decomposition [25]. The first derivativefilter as developed in [26] is steered to 4 orientations attwo scales (subsampled by a factor of two). A histogramis then constructed for each of the 8 bandpass filter outputs

ti;x ¼Xu2Sx

diðtuÞ; i ¼ 1; . . . ;BT; ð15Þ

where tu 2 {1, . . . ,BT} is the histogram bin index associat-ed with the steerable filter output h at pixel location u, withBE number of bins. The histogram is normalised such thatPBE

i¼1ei;x ¼ 1.

4.3. Edge cue

Edge cues are useful for modelling the structure of theobject to be tracked. The edges are described using a histo-gram based on the estimated edge direction. Given animage region Sx the intensity of the pixels in that regionare IðSxÞ. The edge images are constructed by estimatingthe gradients oI

ox and oIoy in the x and y directions, respectively

by Prewitt operators. The edge strength m and direction hare then approximated as

mðuÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffioI

oxþ oI

oy

s; hðuÞ ¼ tan�1 oI

oy

�oI

ox

� �: ð16Þ

The edge direction is filtered to include only edges withmagnitude above a predefined threshold

hðuÞ ¼hðuÞ; mðuÞ > threshold;

0; otherwise:

�ð17Þ

A histogram ei,x of the edge directions h is then constructed

ei;x ¼Xu2Sx

diðbuÞ; i ¼ 1; . . . ;BE; ð18Þ

where bu 2 {1, . . . ,BE} is the histogram bin index associat-ed with the thresholded edge gradient h at pixel location u,with BE number of bins. The histogram is normalised suchthat

PBE

i¼1ei;x ¼ 1.

4.4. Weighted histograms

The above histograms discard all information about thespatial arrangement of the features in the image. An alter-native approach that incorporates the pixel distributioncan help to give better performance [5]. More specificallywe can give greater weighting to pixels in the center ofthe image region. This weighting can be done through theuse of a convex and monotonically decreasing kernel, forexample the Epanechnikov kernel [5] or the ellipticalGaussian function

KðuÞ ¼ 1

2pqxqyexp �ðx� xÞ2

2q2x

þ ðy � yÞ2

2q2y

!; ð19Þ

o tracking by fusing multiple cues ..., Image and Vision Comput-

Fig. 2. The effect of the r value on the cues. The results shown are for thecolour cue, however, similar results apply for texture and edges. It can beseen that as r decreases there is more discrimination in the likelihood. As rbecomes smaller the likelihood tends to zero. A method for dynamicsetting of r is presented in Section 4.6.

6 P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx

ARTICLE IN PRESS

where the values q2x and q2

y control the spatial significance ofthe weighting function in the x and y directions and the cen-tre pixel in the target region is at (x; y). This kernel can beused to weight the pixel when extracting the histogram

hi;x ¼Xu2Sx

KðuÞdiðbuÞ; i ¼ 1; . . . ;BC: ð20Þ

The histogram is normalised so thatPBC

i¼1hi;x ¼ 1.

4.5. Distance measure

The Bhattacharyya measure [27] has been used previouslyfor colour cues [5,7] because it has the important propertythat q(p,p) = 1. In the case here the distributions for eachcue are represented by the respective histograms [28]

qðhref ; htarÞ ¼XB

i¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffihref ;ihtar;i

p; ð21Þ

where two normalised histograms htar and href describe thecues for a target region defined in the current frame and areference region in the first frame, respectively. The mea-sure of the similarity between these two distributions isthen given by the Bhattacharyya distance

dðhref ; htarÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� qðhref ; htarÞ

p: ð22Þ

The larger the measure q(href,htar) is the more similar thedistributions are. Conversely, for the distance d, the smallerthe value the more similar the distributions (histograms)are. For two identical normalised histograms we obtaind = 0 (q = 1) indicating a perfect match.

Based on (22) a distance D2 for colour can be definedthat takes into account all of the colour channels

D2ðhref ; htarÞ ¼1

3

Xc2fR;G;Bg

d2ðhcref ; h

ctarÞ: ð23Þ

The distance D2 for the edges is equal to d2 since there isonly one component. The distance D2 for texture is

D2ðhref ; htarÞ ¼1

8

Xx2f1;...;8g

d2ðhxref ; h

xtarÞ; ð24Þ

where x is the channel in the steerable-pyramid decomposi-tion. The likelihood function for the cues can be defined by [9]

LðzjxÞ / exp �D2ðhref ; hxÞ2r2

� �; ð25Þ

where the standard deviation r specifies the Gaussian noisein the measurements. Note that small Bhattacharyya dis-tances correspond to large weights in the particle filter.The choice of an appropriate value for r is usually left asa design parameter, a method for setting and adaptingthe value is proposed in Section 4.6.

4.6. Dynamic parameter setting

Fig. 2 shows the likelihood surface for a single cue,applied to one frame, with different values of r. It can be

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

seen that as r is varied the likelihood surface changes sig-nificantly. The likelihood becomes more discriminating asthe value of r is decreased. However, if r is too smalland there has been some change in the appearance of theobject, due to noise, then the likelihood function may haveall values near to or equal to zero. The value of the noiseparameter r has a major influence on the properties ofthe likelihood (25). Typically, the choice of this value is leftas a design parameter to be determined usually by experi-mentation. For a well constrained problem, such as facetracking, analysis can be performed off-line to determinean appropriate value. However, if the algorithm is to beused to track a priori unknown objects, it may not be pos-sible to determine one value for all objects. To overcomethe problems of choosing an appropriate value for r, anadaptive scheme is presented here which aims to maximisethe information available in the likelihood, using the Bhat-tacharyya distance d. We define the minimum squared dis-tance D2

l;min as the minimum distance D2 of the set ofdistance calculated for all particles with a particular cue l

(l = 1,2, . . . ,L), where L is the number of cues. Rearrang-ing the likelihood yields

logðLÞ ¼ � D2

2r2ð26Þ

from which we can get

r ¼ffiffiffi2p

2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�

D2l;min

logðLÞ

s: ð27Þ

o tracking by fusing multiple cues ..., Image and Vision Comput-

P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx 7

ARTICLE IN PRESS

For example if the choice is made to maximise the informa-tion by setting r to give a maximum likelihood

logðLÞ ¼ �1 ðL � 0:36Þ then r ¼ ðffiffiffiffiffiffiffiffiffiffiffiffiffiffi2D2

l;min

qÞ=2.

4.7. Multiple cues

The relationship between different cues has been treateddifferently by different authors. For example, [29] makesthe assumption that colour and texture are not indepen-dent. However, other works [11,30] assume that colourand texture cues are independent. For the purposes ofimage classification the independence assumption betweencolour and texture is applied to feature fusion in a Bayesianframework [31]. There is generally agreement that in prac-tice colour and texture and colour and edges do combinewell together.

We assume that the considered cue combinations, col-our and texture and colour and edges are independent.With this assumption the overall likelihood function ofthe particle filter represents a product of the likelihoodsof the separate cues

LfusedðzkjxkÞ ¼YL

l¼1

Llðzl;kjxkÞ�l : ð28Þ

The cues are adaptively weighted by weighting coeffi-cients �l, zk denotes the measurement vector, composedof the measurement vectors zl,k from the lth cue forl = 1, . . . ,L.

1

1.5

2

2.5

3

3.5

4

4.5

RM

SE

xy

NAWHWH & A

4.8. Adaptively weighted cues

A method is presented here which takes account of theBhattacharyya distance (22) to give some significance tothe likelihood obtained for each cue based on the currentframe. This is different to previous works which use theperformance of the cues over the previous frames [9],not taking into account information from the latest mea-surements. This allows an estimate to be made for �l in(28). Using the smallest value of the distance measureD2

l;min for each cue the weight for each cue l is determinedby

�l ¼1

D2l;min

; l ¼ 1; . . . ; L: ð29Þ

The weights are then normalised such thatPL

l¼1�l ¼ 1

�l ¼�lPL

l¼1

�l

; l ¼ 1; . . . ; L: ð30Þ

0 5 10 15 20 25 30 35 40 45 500

0.5

Frames

Fig. 3. Results from: (1) nonadaptive cues (N), (2) adaptive cues withautomatic setting of r (Ar), (3) Gaussian weighting kernel for thehistogram (WH), (4) both WH and Ar.

5. Experimental results

This section evaluates the performance of: (i) the colour,texture and edge cues (ii) combined cues and (iii) themixed-state model in sequences with occlusion.

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

The combined root mean squared error (RMSE) [32]

RMSExy ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

R

XR

i¼0

ðxi � xiÞ2 þ ðyi � yiÞ2vuut ð31Þ

of the pixel coordinates (xi,yi) to their estimates ðxi; yiÞ isthe measure used to evaluate the performance of the devel-oped technique in each frame i = 1, . . . ,Nf over R = 100independent Monte Carlo realisations.

5.1. Dynamic r and weighted histograms

Two techniques were given in this paper to improve theperformance of the cues: (i) automatic setting of the noiseparameters r for the likelihood (Section 4.6) and (ii)weighting of the pixels in the histogram extraction process(Sections 4.1, 4.2 and 4.3). The effect of these techniquescan be seen in Fig. 3, where the RMSExy is shown for100 realisations each with N = 500 particles, Nthresh =N/2 for tracking in a synthetic sequence using colour cues.

Four different implementations are compared: (1) non-adaptive cues (N), (2) adaptive cues with automatic settingof r (Ar), (3) Gaussian weighting kernel for the histogram(WH) and (4) and both WH and Ar. The automatic settingof r and the use of a Gaussian weighting kernel for the his-togram both provide an improvement. The smallest error isseen when the Gaussian weighting kernel for the histogramis combined with automatic setting of r (WH and Ar).

5.1.1. Single cues

Three very different tracking scenarios are used to high-light some of the strengths and weaknesses of the individ-ual cues. All of the results are obtained using 500particles. The first sequence is a wildlife problem whichinvolves tracking a penguin [33] moving across a snowybackground, Fig. 4. As is common in wildlife scenariosthe colour of the object to be tracked is similar to the back-ground. The particle filter with colour cues (Figs. 4(a–c)) is

o tracking by fusing multiple cues ..., Image and Vision Comput-

a b c

d e f

g h i

Fig. 4. Tracking a penguin against a snowy background. The colour cues(a–c) are distracted by the snowy background. Using either the edge (d–f)or the texture cue (g–i) provide an improvement.

8 P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx

ARTICLE IN PRESS

distracted by similar coloured background regions. Boththe particle filter with edge cues (Figs. 4(d–f)) and with tex-ture cues (Figs. 4(g–i)) perform much better and track thepenguin successfully.

The second sequence is a car tracking [33] problem withthe car undergoing a significant and rapid change in scale.It can be seen that despite the change in scale both the par-ticle filters with colour (Figs. 5(a–c)) and edge cues (Figs.

a b c

fed

g h i

Fig. 5. Tracking a car, as it moves away it undergoes a significant changein scale. The colour (a–c) and edge (d–f) cues both cope with the scalechange. After the original object has undergone some change the texturecues get distracted by the background.

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

5(d–f)) are able to keep track of the full state of the car.However, the texture cues (Figs. 5(g–i)) fail because ofthe change in appearance and distractions in thebackground.

The final example from single cues is an example oftracking a logo, in a longer sequence, that is undergoingtranslation, rotation and some small amount of scalechange. All three cues are able to keep track of the loca-tion of the sequence but the particle filter with colour cues(Figs. 6(a–c)) does not provide accurate state informationfor the rotation of the object. The particle filter with tex-ture cues (Figs. 6(d–f)) provides better information aboutthe rotation of the object and the particle filter with edgecues (Figs. 6(g–i)) provides the most accurate result of thethree.

5.1.2. Comparison with mean-shift tracker

An alternative tracking technique that has received aconsiderable amount of research interest recently is themean-shift tracker [6,34]. A comparison between the visualparticle filter presented here and the mean-shift tracker isparticularly relevant because they are both based on histo-gram analysis. The mean-shift tracker is a mode-findingtechnique that locates the local minimum of the posteriordensity function. Based on the mean-shift vector, receivedas an estimation of the gradient of the Bhattacharyya func-tion, the new object state estimate is calculated. The mean-shift algorithm was implemented with Epanechnikov ker-nel [6].

A comparison between the results of the particle filterand the mean-shift tracker can be seen in Fig. 7, both algo-

Fig. 6. Logo tracking as it undergoes translation, rotation and mild scalechange. The colour (a–c) cues are able to locate the object but it does notsuccessfully capture the object rotation. The texture cues (g–i) providemore accurate information about the rotation and the edge cues (d–f)provide the most accurate information about the rotation.

o tracking by fusing multiple cues ..., Image and Vision Comput-

Fig. 7. Tracking a hand using colour cues to compare the performance ofthe mean-shift tracker with the particle filter. The mean shift (a–c) trackergets distracted by the face and does not recover. The particle filter (d–f) isdistracted by the face but because it is able to maintain a multi-modaldistribution it is able to recover.

Fig. 8. The particle filter is run with adaptively weighted colour and edgecues to track the football player’s helmet. (a–c) Show that the helmet issuccessfully tracked. The weighting assigned to the cues is shown in (d).

P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx 9

ARTICLE IN PRESS

rithms use colour cues. The mean-shift tracker is unable totrack successfully as the hand is moved in front of the face.The particle filter is slightly distracted by the face, however,it successfully tracks the hand due to the fact that it canmaintain a multi-modal distribution for a number offrames whereas the mean-shift tracker is not able to. Thisillustrates the superiority of the particle filter with respectto the mean-shift tracker in the presence of ambiguoussituations.

The results presented here were obtained from Matlabimplementation, where the particle filter takes in theorder of 10 times longer than the mean shift tracker. ThePF with colour cue was also implemented in C++ softwareon a standard PC computer (with Pentium CPU and2.66 GHz) and has shown abilities to process 25–30frames per second with particles in the range from 500 to100. This result shows the applicability of the PF to real-time problems. The same algorithm, ran with the Matlabcode needs 8 times more computational time than itsC++ version.

5.1.3. Multiple cues

From the results presented in the previous section it canbe seen that no single cue can provide accurate resultsunder all conditions. In this section, we look at the perfor-mance change when combining the cues.

First, the behaviour of the cue weighting scheme intro-duced in Section 4.7 is explored with an example as shownin Fig. 8. In (d) the change in weights from edge to colourcan be seen. At the start of the sequence the edges providemore accurate results and is therefore given a higherweighting. As the players turn around the edges in thescene change and therefore the model learnt from the firstframe becomes less reliable. In contrast the colour of theregion does not change significantly and so becomes rela-tively more reliable and is given a higher weighting.

In the previous section, the PF with the colour cue failedto track the penguin successfully (Fig. 4), we now look at

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

how the performance of the PF is effected if the colour cuesare combined with the edge and texture information. It canbe seen in Fig. 9 that the previous performance of the edgeand texture cues is maintained even when the colour cuesare combined with them. In a hand tracking scenario theparticle filter with edge cues (Figs. 10(a–c)) fails but theparticle filter with colour cues (Figs. 10(d–f)) is successful.This is due to the fact that the particle filter can maintainmulti-modal posterior distributions. The particle filter withcombined colour and edge cues (Figs. 10(g–i)) successfullytracks the hand through the entire sequence.

o tracking by fusing multiple cues ..., Image and Vision Comput-

a

b

c

a b c

fed

Fig. 9. The same sequence as in Fig. 4 for which colour tracking failed.(a–c) show the results from tracking with combined colour and edge cues,(d–f) from colour and texture cues. It can be seen that combining thecolour cues with either edge and texture cues provides accurate tracking.

10 P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx

ARTICLE IN PRESS

5.2. Occlusion handling and handling the changeable window

size

It is important that the tracking process is robust toboth partial occlusions and is able to recover after a fullocclusion. The sequence shown in Fig. 11 contains fullocclusions from which the tracker successfully recovers.This is due to the mixed-state model described in Section3.4 that provides the ability to recover when the objectundergoes full occlusion or re-enters the frame after leav-

Fig. 11. In this sequence the serve speed board (a) is being tracked, itundergoes both partial and full occlusions. The results show that the cuesare resilient to partial occlusions, see (b) and (c). Using the mixed-statemodel the tracker is able to recover following a full occlusion.

Fig. 10. Hand tracking as it undergoes translation, rotation and mild scalechange. The colour (a–c) cues track the object, although some distractionis caused by the face. The edge cues (d–f) get distracted by the edgeinformation in the light. Combining colour and edges (g–i) provides moreaccurate tracking of the hand.

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

ing. Additionally, the use of the Gaussian kernel enhancesthe accuracy when features are extracted from the frame.

6. Conclusions and future work

This paper has presented a sequential Monte Carlo tech-nique for object tracking in a broad range of videosequences with visual cues. The visual cues, colour, edgeand texture, form the likelihood of the developed particlefilter. A method for automatic dynamic setting of the noiseparameters of the cues is proposed to allow more flexibilityin the object tracking. Multiple cues are combined fortracking, which has been shown to make the particle filtermore able to accurately and robustly track a range ofobjects. These techniques can be further extended withother visual cues such as motion and non-visual cues.

o tracking by fusing multiple cues ..., Image and Vision Comput-

P. Brasnett et al. / Image and Vision Computing xxx (2006) xxx–xxx 11

ARTICLE IN PRESS

The developed particle filter is compared with the mean-shift algorithm and its reliability is shown also in the pres-ence of ambiguous situations. Nevertheless that the particlefilter is more time consuming than the mean-shift algo-rithm, it runs comfortably in real time.

Current and future areas for research include the inves-tigation of alternative data fusion schemes, improved pro-posal distributions, tracking multiple objects and onlineadaption of the target model.

Acknowledgements

The authors are grateful to the financial support bythe UK MOD Data and Information Fusion DefenceTechnology Centre for this work.

References

[1] M. Isard, A. Blake, Condensation – conditional density propagationfor visual tracking, International Journal of Computer Vision 28 (1)(1998) 5–28.

[2] N. Gordon, D. Salmond, A. Smith, A novel approach to nonlinear/non-Gaussian Bayesian state estimation, IEEE Proceedings-F 140(1993) 107–113.

[3] A. Doucet, S. Godsill, C. Andrieu, On sequential Monte Carlosampling methods for Bayesian filtering, Statistics and Computing 10(3) (2000) 197–208.

[4] M. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial onparticle filters for online nonlinear/non-Gaussian Bayesian tracking,IEEE Transactions on Signal Processing 50 (2) (2002) 174–188.

[5] K. Nummiaro, E. Koller-Meier, L.V. Gool, An adaptive color-basedparticle filter, Image and Vision Computing 21 (1) (2003) 99–110.

[6] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigidobjects using mean shift, in: Proceedings of the 1st Conference onComputer Vision and Pattern Recognition. Hilton Head, SC, 2000,pp. 142–149.

[7] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking,IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5) (2003) 564–577.

[8] C. Shen, A. van den Hengel, A. Dick, Probabilistic multiplecue integration for particle filter based tracking, in: C. Sun, H.Talbot, S. Ourselin, T. Adriansen (Eds.), Proceedings of the VIIthDigital Image Computing: Techniques and Applications, December10–12, 2003.

[9] P. Perez, J. Vermaak, A. Blake, Data fusion for tracking withparticles, Proceedings of the IEEE 92 (3) (2004) 495–513.

[10] M. Spengler, B. Schiele, Towards robust multi cue integration forvisual tracking, Machine Vision and Applications 14 (1) (2003) 50–58.

[11] E. Ozyildiz, N. Krahnstover, R. Sharma, Adaptive texture and colorsegmentation for tracking moving objects, Pattern Recognition 35(2002) 2013–2029.

[12] P. Perez, C. Hue, J. Vermaak, M. Ganget, Color based probabilistictracking, in: Proceedings of the 7th European Conference onComputer Vision, vol. 2350, LNCS, Copenhagen, Denmark, 2002,pp. 661–675.

[13] E. Poon, D. Fleet, Hybrid Monte Carlo filtering: edge-based trackingpeople, in: Proceedings of the IEEE Workshop on Motion and VideoComputing, Orlando, Florida, 2002, pp. 151–158.

Please cite this article as: Paul Brasnett et al., Sequential Monte Carling (2006), doi:10.1016/j.imavis.2006.07.017

[14] K. Toyama, G. Hager, Incremental focus of attention for robustvision-based tracking, International Journal of Computer Vision 35(1) (1999) 45–63.

[15] J. Triesch, C. von der Malsburg, Democratic integration: self-organized integration of adaptive cues, Neural Computation 13 (9)(2001) 2049–2074.

[16] A. Jepson, D. Fleet, T.F. El-Maraghi, Robust online appearancemodels for visual tracking, IEEE Transaction on Pattern Analysisand Machine Intelligence 25 (10) (2003) 1296–1311.

[17] P. Brasnett, L. Mihaylova, N. Canagarajah, D. Bull, Particle filteringwith multiple cues for object tracking in video sequences, in:Proceedings of SPIE’s 17th Annual Symposium on ElectronicImaging, Science and Technology, vol. 5685, 2005, pp. 430–441.

[18] N. Bergman, Recursive Bayesian estimation: navigation and trackingapplications, Ph.D. dissertation, Linkoping University, Linkoping,Sweden, 1999.

[19] J. Liu, R. Chen, Sequential Monte Carlo methods for dynamicsystems, Journal of the American Statistical Association 93 (443)(1998) 1032–1044, Online available: citeseer.nj.nec.com/article/liu98sequential.html..

[20] E. Wan, R. van der Merwe, Kalman Filtering and Neural Networks,Wiley Publishing, September 2001, chapter 7. The Unscented Kalmanfilter 7, pp. 221–280.

[21] Y. Bar-Shalom, X. Li, Estimation and Tracking: Principles, Tech-niques and Software, Artech House, 1993.

[22] M. Isard, A. Blake, Condensation – conditional density propogationfor visual tracking, International Journal of Computer Vision 29 (1)(1998) 5–28.

[23] M. Isard, A. Blake, Condensation: unifying low-level and high-leveltracking in a stochastic framework, in: Proceedings of the 5thEuropean Conference on Computer Vision, vol. 1, 1998, pp. 893–908.

[24] R. Porter, Texture Classification and Segmentation, Ph.D thesis,University of Bristol, Center for Communications Research, 1997.

[25] W. Freeman, E. Adelson, The design and use of steerable filters, in:IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 13, 1991.

[26] A. Karasaridis, E. P. Simoncelli, A filter design technique forsteerable pyramid image transforms, in: Proceedings of the ICASSP,Atlanta, GA, May 1996.

[27] A. Bhattacharyya, On a measure of divergence between two statisticalpopulations defined by their probability distributions, Bulletin of theCalcutta Mathematical Society 35 (1943) 99–110.

[28] D. Scott, Multivariate Density Estimation: Theory, Practice andVisualization, ser. Probability and Mathematical Statistics, JohnWiley and Sons, New York, 1992.

[29] R. Manduchi, Bayesian fusion of colour and texture segmentations,in: Proceedings of the IEEE International Conference on ComputerVision, September 1999.

[30] E. Saber, A.M. Tekalp, Integration of color, edge, shape and texturefeatures for automatic region-based image annotation and retrieval,Journal of Electronic Imaging (special issue) 7 (3) (1998) 684–700.

[31] X. Shi, R. Manduchi, A study on Bayes feature fusion for imageclassification, in: Proceedings of the IEEE Workshop on StatisticalAnalysis in Computer Vision, vol. 8, June 2003.

[32] Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applica-tions to Tracking and Navigation, John Wiley and Sons, New York,2001.

[33] BBC Radio 1, http://www.bbc.co.uk/calc/radio1/, index.shtml, 2005,Creative Archive Licence.

[34] D. Comaniciu, P. Meer, Mean shift: a robust approach towardfeature space analysis, IEEE Transactions on Pattern Analysis andMachine Intelligence 22 (5) (2002) 603–619.

o tracking by fusing multiple cues ..., Image and Vision Comput-