data fusion for visual tracking with particles - citeseerx

2 PROCEEDINGS OF THE IEEE, VOL.92, NO.3, MARCH 2004

0 1000 2000 3000 4000

0 1000 2000 3000 4000

Fig. 1. Three types of raw data. We consider color based tracking combined with either motion cues (for surveillance with a static camera) or stereosound cues (for speaker face tracking in a tele-conference setup). The corresponding measurements are respectively: (Left) RGB color video frames, (Middle)absolute luminance differences for pairs of successive frames, and (Right) stereo pairs of sound signal sections.

the camera, and is orthogonal to the camera optical axis.Sound localization cues are then obtained by measuring theTime Delay of Arrival (TDOA) between signals arriving atthe two microphones comprising the pair. The TDOA givesan indication of the bearing of the sound source relative to themicrophone pair. Given the configuration of the system thisbearing can in turn be related to a horizontal position in theimage (Fig.2).

The color cues tend to be remarkably persistent and robustto changes in pose and illumination. They are, however, moreprone to ambiguity, especially if the scene contains otherobjects characterized by a color distribution similar to thatof the object of interest. The motion and sound cues, on theother hand, tend to be intermittent, but are very discriminantwhen they are present,i.e., they allow the object to be locatedwith low ambiguity.

The localization cues impact the particle filter based trackerin a number of ways. As is standard practice, we constructa likelihood model for each of the cues. These models areassumed to be mutually independent, an assumption that canbe justified in the light that any correlation that may existbetween the color, motion and sound of an object is likelyto be weak. The intermittent and discriminant nature of themotion and sound cues make them excellent candidates forthe construction of detection modules and efficient proposaldistributions. We will exploit this characteristic extensively.

Finally, the differing nature of the cues and the configurationof the system allow us to experiment with the order andmanner in which the cues are incorporated. For example,since the sound cue only gives localization information in thehorizontal direction of the image, we can search this directionfirst, and confine the search in the remainder of the state-spaceto regions for which the horizontal image component havebeen deemed highly likely to contain the object of interest.This strategy is known as Partitioned Sampling [32], andallows for a more efficient exploration of the state-space.

The remainder of the paper is organized as follows. Sec-tion II briefly outlines the Bayesian Sequential Estimationframework, and shows how a Monte Carlo implementationthereof leads to the Particle Filter. It also presents some alter-native particle filter architectures for cases where informationfrom multiple measurement sources are available. SectionIIIpresents and discusses all the ingredients of our proposeddata fusion tracker based on color, motion and sound. Thissection is concluded with a summary of the tracking algorithm.Section IV presents some tracking scenarios, and illustrates

the performance of the tracking algorithm under a variety ofconditions. The usefulness of each localization cue, and theircombined impact are evaluated. Finally, we conclude the paperin SectionV with a summary and some suggestions for futureresearch.

II. SEQUENTIAL MONTE CARLO AND DATA FUSION

Sequential Monte Carlo techniques for filtering time series[15], [18], [20], [31], and their use in the specific contextof visual tracking [22], have been described at length inthe literature. In what follows we give a brief summary ofthe framework, and discuss in some detail the architecturalvariations that are afforded by the presence of multiple mea-surement sources.

Denote byxn and yn the hidden state of the object ofinterest and the measurements at discrete timen, respec-tively. For tracking the distribution of interest is the posteriorp(xn|y1:n), also known as the filtering distribution, wherey1:n = (y1 · · ·yn) denotes all the observations up to thecurrent time step. In Bayesian Sequential Estimation thefiltering distribution can be computed according to the twostep recursion1

prediction step:

p(xn|y1:n−1) =∫

p(xn|xn−1)p(xn−1|y1:n−1)dxn−1 (1)

filtering step:

p(xn|y1:n) ∝ p(yn|xn)p(xn|y1:n−1), (2)

where the prediction step follows from marginalisation, andthe new filtering distribution is obtained through a directapplication of Bayes’ rule. The recursion requires the spec-ification of a dynamic model describing the state evolution,p(xn|xn−1), and a model that gives the likelihood of any statein the light of the current observation,p(yn|xn), along withthe following conditional independence assumptions:

xn ⊥ y1:n−1|xn−1 andyn ⊥ y1:n−1|xn. (3)

The recursion is initialized with some distribution for the ini-tial statep(x0). Once the sequence of filtering distributions isknown point estimates of the state can be obtained according toany appropriate loss function, leading for example to the Max-imum a Posteriori (MAP) estimate,arg maxxn p(xn|y1:n),

1Notation “∝” means that the conditional distribution on the left isproportional to the function on the right up to a multiplicative “constant”that may depend on the conditioning argument.

PEREZ et al.: DATA FUSION FOR ROBUST VISUAL TRACKING 3

and to the Minimum Mean Square Error (MMSE) estimate,∫xnp(xn|y1:n)dxn.The tracking recursion yields closed-form expressions in

only a small number of cases. The most well-known of theseis the Kalman filter [1] for linear Gaussian likelihood andstate evolution models. The models encountered in visualtracking are often non-linear, non-Gaussian, multi-modal, orany combination of these. One reason for this stems from thefact that the observation model often specifies which part ofthe data is of interest given the state, leading top(yn|xn) beingnon-linear and often multi-modal with respect to the statexn.This renders the tracking recursion analytically intractable, andapproximation techniques are required.

Sequential Monte Carlo methods [15], [18], [20], [31],otherwise known as Particle Filters, have gained a lot ofpopularity in recent years as a numerical approximation to thetracking recursion for complex models. This is due to theirsimplicity, flexibility, ease of implementation, and modelingsuccess over a wide range of challenging applications.

The basic idea behind particle filters is very simple. Startingwith a weighted set of samples{x(i)

n−1, w(i)n−1}Np

i=1 approxi-mately distributed according top(xn−1|y1:n−1), new samplesare generated from a suitably chosen proposal distribution,which may depend on the old state and the new measurements,i.e., x(i)

n ∼ q(xn|x(i)n−1,yn), i = 1 · · ·Np. To maintain a

consistent sample the new importance weights are set to2

w(i)n ∝ w

(i)n−1

p(yn|x(i)n )p(x(i)

n |x(i)n−1)

q(x(i)n |x(i)

n−1,yn), with

Np∑

i=1

w(i)n = 1.

(4)The new particle set{x(i)

n , w(i)n }Np

i=1 is then approximatelydistributed according top(xn|y1:n).

Approximations to the desired point estimates can then beobtained by Monte Carlo techniques. From time to time itis necessary to resample the particles to avoid degeneracy ofthe importance weights, that is the concentration of most ofthe weight on a single particle. In absence of resampling, thisphenomenon always occurs in practice, dramatically degradingthe sample based approximation of the filtering distribution,along with the quality of any point estimate based on it. Theresampling procedure essentially multiplies particles with highimportance weights, and discards those with low importanceweights, while preserving the asymptotic properties of thesample based approximation of the filtering distribution. Thisprocedure can be applied at each time step, or be invoked onlywhen a measure of the “quality” of the weights falls below athreshold. A full discussion of degeneracy and resampling fallsoutside the scope of this paper, but more detail can be foundin [15]. The synopsis of the generic particle filter iteration isgiven in Tab.I.

The performance of the particle filter hinges on the qual-ity of the proposal distribution. The Bootstrap Filter [18],

2 This can be seen by considering sampletrajectoriesx(i)1:n [15]. These

are distributed according toq(x1|y1)Qn

k=2 q(xk|xk−1,yk) instead of truetarget distributionp(y1:n)−1p(x1,y1)

Qnk=2 p(yk|xk)p(xk|xk−1). Ac-

cording to importance sampling theory [16], the discrepancy is compensatedfor by associating importance weights proportional to the ratio of the targetdistribution to the proposal distribution, which yields in this case the recursionin (4).

With {x(i)n−1, w

(i)n−1}

Np

i=1 the particle set at the previous time step, proceedas follows at timen:Proposition: simulatex(i)

n ∼ q(xn|x(i)n−1,yn).

Update weights:

w(i)n ∝ w

(i)n−1

p(yn|x(i)n )p(x

(i)n |x(i)

n−1)

q(x(i)n |x(i)

n−1,yn), with

PNp

i=1 w(i)n = 1.

If resampling:simulateai ∼ {w(k)

n }Np

k=1, and replace{x(i)n , w

(i)n } ← {x(ai)

n , 1Np}.

TABLE I

GENERIC PARTICLE FILTER.

which is the first modern variant of the particle filter, usesthe state evolution modelp(xn|xn−1) as proposal distribu-tion, so that the new importance weights in (4) becomeproportional to the corresponding particle likelihoods. Thisleads to a very simple algorithm, requiring only the abilityto simulate from the state evolution model and to evalu-ate the likelihood. However, it performs poorly for narrowlikelihood functions, especially in higher dimensional spaces.In [15] it is proved that the optimal choice for the pro-posal distribution (in terms of minimizing the variance ofthe importance weights) is the posteriorp(xn|xn−1,yn) ∝p(yn|xn)p(xn|xn−1). However, the normalizing constant forthis distribution,

∫p(yn|xn)p(xn|xn−1)dxn = p(yn|xn−1),

is rarely available in closed form, making direct sampling fromthis optimal proposal distribution impossible. The challengein particle filtering applications is then to design efficientproposal distributions that approximate the optimal choice asclosely as possible. We will give careful consideration to thisissue in the design of our tracker in SectionIII .

For multiple measurement sources the general particle filter-ing framework can still be applied. However, it is possible todevise strategies to increase the efficiency of the particle filterby exploiting the relation between the structure of the modeland the information in the various measurement modalities.In what follows we will suppress the time indexn fornotational compactness. Assume that we haveM measurementsources, so that the instantaneous measurement vector can bewritten asy = (y1 · · ·yM ). We will further assume that themeasurements are conditionally independent given the state,so that the likelihood can be factorized as

p(y|x) =M∏

m=1

p(ym|x). (5)

With this setting the generic filter summarized in Tab.I couldbe used as is, with the weight update involvingM likelihoodevaluations according to (5). However, the factorized structureof the likelihood can be better exploited. To this end, weintroduce the following abstract framework: let us assume thatthe state evolution and proposal distributions decompose as

p(x|x′) =∫

pM (x|xM−1) · · · p1(x1|x′)dx1 · · · dxM−1 (6)

q(x|x′,y) =∫

qM (x|xM−1,yM ) · · · q1(x1|x′,y1)dx1 · · · dxM−1,

(7)

4 PROCEEDINGS OF THE IEEE, VOL. 92, NO. 3, MARCH 2004

With {x(i)n−1, w

(i)n−1}

Np

i=1 the particle set at the previous time step, proceedas follows at timen:Initialize : {x0(i), w0(i)}Np

i=1 = {x(i)n−1, w

(i)n−1}

Np

i=1Layered sampling: for m = 1 · · ·M• Proposition: simulatexm(i) ∼ qm(xm|xm−1(i),ym

n ).• Update weights:

wm(i) ∝ wm−1(i) p(ymn |xm(i))pm(xm(i)|xm−1(i))

qm(xm(i)|xm−1(i),ymn )

, withPNp

i=1 wm(i) = 1.

• If resampling: simulate ai ∼ {wm(k)}Np

k=1, and replace{xm(i), wm(i)} ← {xm(ai), 1

Np}.

Terminate: {x(i)n , w

(i)n }Np

i=1 = {xM(i), wM(i)}Np

i=1.

TABLE II

GENERIC LAYERED SAMPLING PARTICLE FILTER TO FUSEM

OBSERVATION MODALITIES.

wherex1 · · ·xM−1 are “auxiliary” state vectors. Eq. (6) sim-ply amounts to a splitting of the original evolution modelinto M successive intermediary steps. This can, for example,be done when the state isM -dimensional and correspondingcomponent-wise evolution models are independent, and/orwhen the evolution model is linear and Gaussian, and canthus be easily fragmented intoM successive steps with lowervariances.

If we make the approximation that the likelihood for them-th measurement modalityp(ym|x) can be incorporated afterapplying them-th state evolution modelpm(xm|xm−1), wecan set up a recursion to compute the new target distributionthat takes the form

πm(xm) ∝∫

wm(xm,xm−1)qm(xm|xm−1,ym)

· πm−1(xm−1)dxm−1, m = 1 · · ·M(8)

with wm(xm,xm−1) =p(ym|xm)pm(xm|xm−1)

qm(xm|xm−1,ym), (9)

where π0 and πM are respectively the previous and thenew filtering distributions,x0 = x′, and xM = x. Thisrecursion can be approximated with a layered samplingstrategy, where at them-th stage new samples are simu-lated from a Monte Carlo approximation of the distributionqm(xm|xm−1,ym)πm−1(xm−1), with an associated impor-tance weight proportional towm(xm,xm−1) to yield a prop-erly weighted sample. The synopsis of this generic layeredsampling strategy for data fusion is given in Tab.II .

As it stands the layered sampling approach provides noobvious advantage over the standard particle filtering frame-work. Its true benefit arises in cases where the measurementmodalities differ in the level of information they provide aboutthe state. If the measurement modalities are then ordered fromcoarse to fine, the layered sampling approach will effectivelyguide the search in the state-space, with each stage refiningthe result from the previous stage.

In some special applications the likelihood and state evolu-tion models are independent over a component-wise partition-

ing of the state-space,i.e.,

p(y|x) =M∏

m=1

p(ym|xm) (10)

p(x|x′) =M∏

m=1

p(xm|x′m), (11)

with x = (x1 · · ·xM ). For models of this nature the layeredsampling procedure, with

pm(xm|xm−1) = p(xmm|xm−1

m )∏

k 6=m

δxm−1k

(xmk ) (12)

in (6) and (9), is exact, and known as Partitioned Sampling[32]. It effectively replaces the search in the full state-spaceby a succession ofM easier search problems, each in a lowerdimensional space. We will make use of these strategies whendesigning our tracking algorithm in SectionIII .

III. D ATA FUSION V ISUAL TRACKER

In this section we describe in detail all the ingredients ofour tracking algorithm based on color, motion and sound. Wefirst present the system configuration and the object model,and then proceed to discuss the localization cues and theirimpact on the tracking algorithm in more detail. The sectionis concluded with a summary of the tracking algorithm.

A. Audio-Visual System Setup

The setup of the tracking system is depicted in Fig.2. Itconsists of a single camera and a stereo microphone pair.The line connecting the microphones goes through the opticalcenter of the camera, and is orthogonal to the camera opticalaxis. Note, however, that in our experiments the object ofinterest will not always be a talking head in front of thecamera. In such cases sound measurements will generally beabsent, and we will only rely on the visual cues.

The system requires only a small number of calibrationparameters: the microphone separationd, the camera focallength f , the width of the camera image plane in the realworld W , and the width in pixels of the digital imageW .These parameters are normally easy to obtain. Nevertheless,the tracking algorithm we develop is robust to reasonable in-accuracies in the system setup and variations in the calibrationparameters. Since it is probabilistic in nature, these errors areeasily accommodated by explicitly modeling the measurementuncertainty in the corresponding likelihood models.

Our objective is to track a specified object or region ofinterest in the sequence of images captured by the camera. Tothis end the raw measurements available are the images them-selves, and the audio signals captured at the two microphonescomprising the pair. Since the tracking is performed in thevideo sequence the discrete time indexn corresponds to thevideo frame number. As opposed to the video sequence thatis naturally discretized, the audio samples arrive continuously,and there is no notion of natural audio frames. For the purposesof the tracking algorithm, however, we define then-th audioframe as a window ofNs audio samples centered aroundthe sample corresponding to then-th video frame. If Tv


��

�

�

� � ��

��

�� !� ��"��

#

$

%

&$

'

%

(

Fig. 2. Setup for audio-visual tracking. The system calibration parameters are the microphone separationd, the camera focal lengthf , the width of theimage plane in the real worldfW , and the width in pixels of the digital imageW .

and Ts denote the sampling period for video frames andaudio samples, respectively, the center of the audio framecorresponding to then-th video frame can be computed as

ns = [(n− 1)Tv/Ts + 1], (13)

where [·] denotes the rounding operation. The number ofsamples in the audio frameNs is normally taken such thatthe duration of the audio frame is roughly 50ms.

The raw image and audio frames are very high dimensional,and contain lots of information that is redundant with regardto object tracking. We thus pass the raw data through asignal processing front-end with the purpose of extractingfeatures important for the tracking process. More specifically,we extract color and motion features from the image data, andTime Delay Of Arrival (TDOA) features from the audio data.With color being the most persistent feature of the object wewill use it as the main visual cue and fuse it with the moreintermittent motion and sound cues. In what follows we willdenote the combined color, motion and sound measurementsat timen by yn = (yC

n ,yMn ,yS

n). We will suppress the timeindex n in cases where there is no danger of ambiguitiesarising. The measurement procedures for each of the cues aredescribed in detail in the relevant sections that follow.

B. Object Model

Our objective is to track a specified object or region ofinterest in the image sequence. We will aim to use modelsthat make only weak assumptions about the precise objectconfiguration, so as not to be too restrictive about the typesof objects that can be tracked, and to achieve robustness tolarge variations in the object pose, illumination, motion, etc.In the approach adopted here the shape of the reference region,denoted byB?, is fixed a priori. It can be an ellipse orrectangular box as in [4], [6], [9], but our modeling frameworkplaces no restrictions on the class of shapes that can beaccommodated. More complex hand-drawn or learned shapescan be used if relevant.

Tracking then amounts to recursively estimating the pa-rameters of the transformation to apply toB? so that theimplied region in each frame best matches the original ref-erence region. Affinity or similitude transforms are popularchoices. Since the color model we will describe in SectionIII-C is global with respect to the region of interest, weconsider only translation and scaling of the reference region.This means that the reference region can be parameterized

as B? = (x?, y?, w?, h?), where (x?, y?) is the center ofthe reference region bounding box, andw? and h? are itswidth and height, respectively. We define the hidden stateas xn = (xn, yn, αn) ∈ X , with X denoting the state-space, so that the corresponding candidate region becomesBxn

= (xn, yn, αnw?, αnh?). The variables(xn, yn) thusform the center of the candidate region, andαn acts as ascale factor.

Most objects move in a fairly predictable way. It is thusgood practice in general to design the state evolution modelp(xn|xn−1) to capture the salient features of the objectmotion. However, we desire our algorithm to be applicableto any object that may be of interest, including people, faces,motor vehicles, etc. Within such a large population of objectsthe variability in the characteristics of the object motion islikely to be high. Furthermore, we are interested in tracking inthe image sequence, and not in the real world. The mappingof the motion from the three dimensional world to the twodimensional image representation is dependent on the systemconfiguration and the direction of the motion, and is unknownin practice. We acknowledge these uncertainties by adopting avery weak model for the state evolution. More specifically, weassume that state components evolve according to mutuallyindependent Gaussian Random Walk models. We augmentthese models with a small uniform component to capture the(rare) event where erratic motion in the real world is perceivedas jumps in the image sequence. It also aids the algorithm inrecovery of lock after a period of partial or complete occlusion.Thus the complete state evolution model can be written as

p(xn|xn−1) = (1− βu)N(xn|xn−1,Λ) + βuUX (xn), (14)

whereN(.|µ,Σ) denotes the Gaussian distribution with meanµ and covarianceΣ, UA(·) denotes the uniform distributionover the setA, 0 ≤ βu ≤ 1 is the weight of the uniform com-ponent, andΛ = diag(σ2

x, σ2y, σ2

α) is the diagonal matrix withthe variances for the random walk models on the componentsof the object state. The weight of the uniform component istypically set to be small. The object model is completed bythe specification of the distribution for the initial state, andhere we assume it to be uniform over the state-space,i.e.,p(x0) = UX (x0).

C. Color Cues

When a specific class of objects with distinctive shape isconsidered and a complete model of this shape can be learned


offline, contour cues are very powerful to capture the visualappearance of tracked entities [2], [3], [25], [30], [40], [44],[53]. They can, however, be dramatically contaminated byclutter edges, even when detailed silhouette models are used[52]. Also, they are not adapted to scenarios where there is nopredefined class of objects to be tracked, or where the class ofobjects of interest does not exhibit very distinctive silhouettes.The two tracking scenarios we consider here fall respectivelyin these two categories.

When shape modeling is not appropriate, color informationis a powerful alternative to characterize the appearance oftracked objects. As demonstrated for example in [4], [9], [10],[39], robust tracking can be achieved using only a simplecolor model constructed in the first frame, even in presenceof dramatic changes of shape and pose. Hence, the colorfeatures of an object or a region of interest often form avery persistent localization cue. In the two tracking scenarioswe are interested in, color information is naturally chosen asthe primary ingredient. In this section we derive the colorlikelihood model that we use.

If color cues are powerful for tracking, their simplicitysometimes results in a lack of discriminative power when itcomes to (re)initialize the tracker. In SectionsIII-D andIII-Ewe show how motion or sound cues can be combined withcolor cues to resolve ambiguities and increase the robustnessof the tracker in two distinct scenarios. In particular, the gooddetectionproperties offered by these two auxiliary modalitieswill be fully exploited in the design of good proposal densities.

Color localization cues are obtained by associating a refer-ence color model with the object or region of interest. Thisreference model can be obtained by hand-labeling, or fromsome automatic detection module. To assess whether a givencandidate region contains the object of interest or not, a colormodel of the same form as the reference model is computedwithin the region, and compared to the reference model. Thesmaller the discrepancy between the candidate and referencemodels, the higher the probability that the object is locatedinside the candidate region.

For the color modeling we use independent normalizedhistograms in the three channels of the RGB color space.We denote theB-bin reference histogram model in channelc ∈ {R,G, B} by hc

ref = (hc1,ref · · ·hc

B,ref ). Recall fromSectionIII-B that the region in the image corresponding toany statex is given byBx. Within this region an estimate forthe histogram color model, denoted byhc

x = (hc1,x · · ·hc

B,x),can be obtained as

hci,x = cH

∑

u∈Bx

δi(bcu), i = 1 · · ·B, (15)

where bcu ∈ {1 · · ·B} denotes the histogram bin index as-

sociated with the intensity at pixel locationu = (x, y) inchannelc of the color imageyC , δa denotes the Kroneckerdelta function ata, andcH is a normalizing constant such that∑B

i=1 hci,x = 1.

The color likelihood model must be defined in such a wayso as to favor candidate color histograms close to the referencehistogram. To this end we need to define a distance metric onhistogram models. As is the case in [9] we base this distance

on the Bhattacharyya similarity coefficient, defining it as

D(h1,h2) =(1−

B∑

i=1

√hi,1hi,2

)1/2

. (16)

In contrast to the Kullback-Leibler divergence this distance is aproper metric, it is bounded within[0, 1], and empty histogrambins are not singular. Based on this distance we finally definethe color likelihood model as

p(yC |x) ∝ exp(−

∑

c∈{R,G,B}D2(hc

x,hcref )/2σ2

C

). (17)

The assumption that the squared distance is exponentiallydistributed is based on empirical evidence gathered over anumber of successful tracking runs. The histogram baseddefinition of the color likelihood is summarized in Fig.3.

If the object of interest is comprised of patches of distinctcolor, such as the face and clothes of a person, the histogrambased color model will still successfully capture the colorinformation. However, all the information about the relativespatial arrangement of these different patches will be lost.Keeping track of the coarse spatial layout of the distinct colorregions may benefit the performance of the tracking algorithm.Such a goal is easily achieved within our modeling frameworkby splitting the region of interest into subregions, each withits own reference color model. More formally, we considerthe partition Bx = ∪NR

j=1Bj,x, associated with the set ofreference color histograms{hc

j,ref : c ∈ {R,G, B}, j =1 · · ·NR}. The subregions are rigidly linked. It is, however,possible to introduce additional state variables to model therelative movement and scaling of the subregions, should thisbe necessary. By assuming conditional independence of thecolor measurements within the different subregions definedby the statex the multi-region color likelihood becomes

p(yC |x) ∝ exp(−

∑

c∈{R,G,B}

NR∑

j=1

D2(hcj,x,hc

j,ref )/2σ2C

),

(18)where the histogramhj,x is collected in the regionBj,x.

This color likelihood model is in contrast with theforeground-background models developed in [23], [55]. Forany hypothesized state the latter models evaluate the pixels(or grid nodes) covered by the object under some referenceforeground model, and the remaining pixels (or grid nodes)under location dependent background models. The likelihoodfor the scene is obtained by multiplying the individual pixel(or grid node) likelihoods under the independence assumption.Even though this type of likelihood is more principled, itsuffers from numerical instabilities, and in our experience thehistogram based color model proposed here is a more powerfultracking cue.

The histogram based color measurements can also be usedto construct an efficient proposal for the particle filter, guidingit towards regions in the state-space that are characterized bya color distribution similar to the object of interest [21], [39].In our setting such a distribution will be of the same formas the one for the motion measurements described in SectionIII-D . However, due to the ambiguity inherent in the color


1 2 3 4 5 1 2 3 4 5

{hBref ,hG

ref ,hRref} Bx, yC hx = {hB

x ,hGx ,hR

x } p(yC |(x, y, 1))

Fig. 3. From color histograms to color likelihood. (Left) A three-fold reference color histogramhref = {hRref ,hG

ref ,hBref} is either gathered in an

initial frame (within a region picked manually or automatically) or learned offline (e.g., skin color). (Middle) Later in the sequence, and for a hypothesizedstatex, the candidate color histogramhx is gathered within the regionBx and compared to the reference histogram using the Bhattacharyya similaritymeasure. (Right) The exponentiated similarity yields the color likelihood, plotted here on a subsampled grid as a function of the location only (scale factorfixed to α = 1). Note the ambiguity: strong responses arise over both faces and a section of the door.

measurements for the two tracking scenarios considered here,we prefer to use the more intermittent, but less ambiguous,motion and sound measurements to construct proposal distri-butions.

D. Motion Cues

Beside color, instantaneous motion activity captures otherimportant aspects of the sequence content, and has beenextensively studied from various perspectives. In particular,the problem of motion detection,i.e., the detection of objectsmoving relative to the camera, is covered by an abundantliterature, which we shall not attempt to review here (seethe review in [29]). In the case of a static camera, the basicingredient at the heart of such an analysis is the absolute framedifference computed on successive pairs of images. This is thiscue we consider here.

We propose to embed the frame difference information ina likelihood model similar to the one developed for the colormeasurements. It unifies the description and implementation,and ensures a similar order of magnitude for the two visualcues. Alternative models can easily be accommodated.

We will denote by yMn the absolute difference of the

luminances at timesn and n − 1. As was the case for thecolor model, a measurement histogramhM

x = (hM1,x · · ·hM

B,x)is associated with the region implied by the statex. Theregion Bx within which the color information is collectedwill often lie inside the object of interest. In contrast, a largepart of the motion activity generated by a moving object isconcentrated along the silhouette of the object. To ensure thatthe silhouette is included we consider a larger region for themotion measurements,i.e., Bx = (x, y, α(w?+η), α(h?+η)),with η set to a few pixel units (5 in our experiments). Theconstruction or learning of a reference histogram model forthe motion measurements is not a straightforward task. Theamplitude of these measurements depend on both the appear-ance of the object (its contours) and its current motion. If theexamined region contains no movement, all the measurementswill fall in the lower bin of the histogram. When movementcommences the measurements usually fall in all bins, withno definitive pattern: uniform regions produce low absoluteframe difference values, whereas higher values characterize thecontours (both the silhouette and the photometric edges). Toaccommodate these variations we chose the reference motion

histogramhMref to be uniform,i.e.,

hMi,ref =

1B

, i = 1 · · ·B. (19)

Similar to the color likelihood in (17), we define the motionlikelihood as

p(yM |x) ∝ exp(−D2(hMx ,hM

ref )/2σ2M ). (20)

The construction of this likelihood is illustrated in Fig.4.Motion Proposal: In the majority of cases the perceived

motion of the object of interest in the image sequence willsatisfy some minimum smoothness constraints. In these cases aproposal distribution that mimics the characteristics of the stateevolution model should be sufficient for successful tracking.However, it often happens that lock is lost due to shortperiods of partial or total occlusion, and it is then necessaryto reinitialize the tracker. In another setting such as tele-surveillance, the objects of interest might be moving entities.In both cases it is useful to design a more sophisticatedproposal distribution that exploits the motion measurements toallow jumps in the state-space to regions of significant motionactivity.

We build such a proposal distribution by evaluating thehistogram similarity measure on a subset of locations overthe image, keeping the scale factor fixed toα = 1. Theselocations are taken as the nodes of a regular grid for whichthe step size depends on the affordable computational load.Typically we used a step size of 10 pixel units. Using simplethresholding, locations that satisfyD2(hM

(x,y,1),hMref ) > τ are

retained. Based on these locations of high motion activity,denoted bypi = (xi, yi), i = 1 · · ·NM , we define a mixtureproposal for the object location(x, y) as

qM (xn,yn|xn−1, yn−1,yMn ) =

βRWN((xn, yn)|(xn−1, yn−1), (σ2x, σ2

y))

+(1− βRW )

NM

NM∑

i=1

N((xn, yn)|pi, (σ2x, σ2

y)).

(21)

The first component is the same Gaussian random walk usedfor the(x, y) location in the state evolution model in (14), andensures the smoothness of the motion trajectories. The secondcomponent is a mixture centered on the detected locations ofhigh motion activity. When no such location is found (NM =0), the weight of the random walk component is set to one.


1 2 3 4 5 6 7 8 9 100

0.7

1 2 3 4 5 6 7 8 9 100

0.7

hMref

eBx, yM hMx p(yM |(x, y, 1))

Fig. 4. From frame difference histograms to motion likelihood. (Left) Uniform reference histogram. (Middle) At a given instant of the sequence, andfor a hypothesized statex, the candidate motion histogramhM

x is gathered within the extended regioneBx, and compared to the reference histogram usingthe Bhattacharyya similarity measure. (Right) The exponentiated similarity yields the motion likelihood, plotted here on a subsampled grid as a function ofthe location only (scale factor fixed toα = 1).

yM 1−D2(hM(x,y,1),h

Mref ), {pi}NM

i=1

PNMi=1 N((x, y)|pi, (σ

2x, σ2

y))

Fig. 5. From frame difference histograms to motion proposal. (Left) Absolute luminance frame difference at a given instant. (Middle) Similarity measurebetween the candidate histogram and the uniform reference histogram, plotted on a subsampled grid as a function of the location only (scale factor fixed toα = 1). Locations of high motion activity are detected by thresholding this function (τ = 0.9), and indicated with crosses. (Right) Mixture of Gaussiansaround the high motion activity locations, forming the motion based proposal distribution.

Fig. 5 illustrates the construction of this motion based proposaldistribution.

E. Sound Cues

This section describes the sound localization cues, following[54]. As is the case for the motion cues, the sound cuesare intermittent, but can be very discriminating when present.The sound localization cues are obtained by measuring theTime Delay of Arrival (TDOA) between the audio signalsarriving at the two microphones comprising the pair. Due tothe configuration of the system the TDOA gives an indicationof the bearing of the sound source relative to the microphonepair, which can in turn be related to a horizontal positionin the image. In what follows we first describe the TDOAmeasurement procedure. We then derive a likelihood modelfor the TDOA measurements. Finally, we develop an efficientTDOA based proposal for the particle filter, based on aninversion of the likelihood model. This proposal is especiallyuseful for initialization and recovering of lock in cases wheretrack is lost during brief periods of partial or total occlusion. Itcan also be used to shift the focus between different speakersas they take turns in a conversation.

1) TDOA Measurement Process:Numerous strategies areavailable to measure the TDOA between the audio signalsarriving at spatially separated microphones. For our trackingapplication the TDOA estimation strategy is required to becomputationally efficient, and should not make strong as-sumptions about the number of audio sources and the exactcharacteristics of the audio signals. One popular strategy thatsatisfies these requirements involves the maximization of theGeneralized Cross-Correlation Function (GCCF) [5], [27].

This strategy, with various modifications, has been appliedwith success in a number of sound source localization systems,e.g., [42], [47], [51], [57]. We subsequently adopt it to obtainTDOA estimates for our tracking algorithm. Computation ofthe GCCF is described at length in [5], [27], and we will omitthe detail here.

The GCCF is essentially the correlation function betweenpre-whitened versions of the signals arriving at the micro-phones. The positions of the peaks in the GCCF can then betaken as estimates of the TDOAs for the sound sources withinthe acoustic environment of the microphones. Apart from thetrue audio sources, “ghost sources” due to reverberation alsocontribute to the peaks in the GCCF. Rather than attemptingto remove these by further signal processing, we retain allthe peaks above a predefined threshold as candidates forthe true TDOA. More specifically, for any frame the soundmeasurement vector can be denoted asyS = (D1 · · ·DNS ),with Di ∈ D = [−Dmax, Dmax], i = 1 · · ·NS . The maximumTDOA that can be measured is easily obtained from themicrophone separationd and the value used for the speed ofsoundc (normally 342ms−1) as Dmax = d/c. Note that thenumber of TDOA measurements varies with time, and quitefrequently no TDOA measurements will be available. To copewith the ambiguity due to the presence of multiple candidatesfor the true TDOA we develop a multi-hypothesis likelihoodmodel, which is described next.

2) TDOA Likelihood Model: In this section we developa multi-hypothesis likelihood model for the TDOA measure-ments. Similar likelihood models have been developed beforefor radar based tracking in clutter [17], and tracking single[20] and multiple [32] objects in video sequences.


The TDOA measurements depend only on thex positionin the image, i.e., p(yS |x) = p(yS |x). Furthermore, forany hypothesis of the objectx position in the image, adeterministic hypothesis for the TDOA can be computed as

Dx = g(x) = g3 ◦ g2 ◦ g1(x), (22)

withx = g1(x) = W (x/W − 0.5)θ = g2(x) = arctan(f/x)Dx = g3(θ) = Dmax cos θ.

(23)

The functiong1 is a simple linear mapping that relates thexposition in the image to the correspondingx position in thecamera image plane. The width of the imageW is measuredin pixels, whereas the width of the image planeW is measuredin metric units. Using a pinhole model for the camera, withfthe focal length, the functiong2 then relates thex position inthe camera image plane to the sound source bearing. Finally,the functiong3 makes use of the Fraunhoffer approximationto relate the sound source bearing to the hypothesized TDOA.Since the mappingg is entirely deterministic the likelihoodcan written asp(yS |x) = p(yS |Dx). We will use this latterform of the likelihood in the exposition below.

We assume the candidate TDOA measurements to be inde-pendent, so that the likelihood can be factorized as

p(yS |Dx) =NS∏

i=1

p(Di|Dx). (24)

In practice, however, clutter measurements due to reverbera-tion are expected to be at least somewhat coherent with the truesource, thus violating the independence assumption. Accuratemodeling of reverberation requires detailed knowledge aboutthe composition and acoustic properties of the environment,which is difficult to obtain in practice, and thus not attemptedhere. Nevertheless, the model still performed well, as we willdemonstrate in SectionIV.

Of the TDOA measurements at most one is associatedwith the true source, while the remainder is associated withclutter. To distinguish between the two cases we introduce aclassification labelci, such thatci = T if Di is associated withthe true source, andci = C if Di is associated with clutter.The likelihood for a measurement from the true source is takento be

p(Di|Dx, ci = T ) = cxN(Di|Dx, σ2D)ID(Di), (25)

whereIA(·) denotes the indicator function for the setA, andthe normalizing constantcx is given by

cx = 2(

erf(Dmax −Dx√

2σD

)− erf

(−Dmax −Dx√2σD

))−1

, (26)

with erf(x) = 2√π

∫ x

0exp

(−t2)dt the Gaussian error func-

tion. Thus within the range of admissible TDOA values themeasurement is assumed to be the true TDOA corrupted byadditive Gaussian observation noise of varianceσ2

D. Empiricalstudies proved this to be a reasonable assumption, as the resultsin Fig. 6 show.

(30 dB, 0.05 s) (30 dB, 0.30 s) (30 dB, 1.00 s)

(15 dB, 0.05 s) (15 dB, 0.30 s) (15 dB, 1.00 s)

−2 0 2

(0 dB, 0.05 s)

−2 0 2

(0 dB, 0.30 s)

−2 0 2

(0 dB, 1.00 s)

Fig. 6. TDOA measurement statistics. TDOA measurement error his-tograms and Gaussian approximations for a range of signal-to-noise levelsand reverberation times.

Similar to what was done in [32] for example, the likelihoodfor measurements associated with clutter is taken to be

p(Di|ci = C) = UD(Di). (27)

Thus the clutter is assumed to be uniformly distributed withinthe admissible interval, independent of the true source TDOA.

For NS measurements there are a total ofNS + 1 possiblehypotheses. Either all the measurements are due to clutter, orone of the measurements corresponds to the true source, andthe remainder to clutter. More formally,

H0 = {ci = C : i = 1 · · ·NS}Hi = {ci = T, cj = C : j = 1 · · ·NS , j 6= i}, (28)

with i = 1 · · ·NS . The likelihoods for these hypotheses followstraightforwardly from (24), and are given by

p(yS |H0) = UDNS (yS)

p(yS |Dx,Hi) = cxN(Di|Dx, σ2D)ID(Di)UDNS−1(yS

−i),(29)

whereyS−i is yS with Di removed.

However, for any set of measurements the correct hypothesisis not known beforehand, and the final likelihood is obtainedby summing over all the possible hypotheses,i.e.,

p(yS |Dx) =NS∑

i=0

p(Hi|Dx)p(yS |Dx,Hi), (30)

wherep(Hi|Dx) is the prior probability for thei-th hypothesis.In what follows we fix the prior probability for the all clutterhypothesis toPc, and set the prior probabilities for the remain-ing NS hypotheses to be equal. Under these assumptions thelikelihood for the TDOA measurements finally becomes

p(yS |Dx) ∝ Pc

2Dmax+

cx(1− Pc)NS

NS∑

i=1

N(Di|Dx, σ2D)ID(Di).

(31)


In the case where no TDOA measurements are available thelikelihood is simply set top(yS |Dx) ∝ 1. The construction ofthe TDOA likelihood is illustrated in Fig.7.

3) TDOA Proposal: As was the case for motion, it ispossible to use the sound localization cues to design anefficient proposal distribution for the particle filter. Such aproposal would allow the tracker to recover lock after briefperiods of partial or total occlusion. In another setting theobjects of interest might be speakers participating in a videotele-conference. In this case a sound based proposal can aidthe tracker to switch focus between the speakers as they taketurns in the conversation.

These objectives can be achieved by designing a proposaldistribution for the objectx position that incorporates theTDOA measurements when they are available. Informally,the inverse of the mapping in (22) is easy to obtain. EachTDOA measurement can then be passed through the resultinginverse mappingg−1 to yield a plausiblex position for theobject. To capture the notion of smooth motion trajectoriesand exploiting the information in the TDOA measurementswe define a TDOA based proposal for the objectx positionof the form

qS(xn|xn−1,ySn) = βRWN(xn|xn−1, σ

2x)

+(1− βRW )

NS

∣∣∣dg(xn)dxn

∣∣∣NS∑

i=1

N(g(xn)|Di,n, σ2D).

(32)

The first component is the same Gaussian random walk usedfor the x component in the state evolution model in (14),and ensures the smoothness of the motion trajectories. Thesecond component is a mixture that incorporates the TDOAmeasurements, and is obtained by inverting the non-clutter partof the likelihood model in (31). All the TDOA measurementsare equally weighted in the mixture. The derivative of themapping g is easily obtained by using the chain rule. Theweight of the random walk component is set to one if nomeasurements are available, in which case no jumps areallowed in the objectx position.

Generating samples from the TDOA based proposal isstraightforward. First a mixture component is picked bysampling from the discrete distribution equal to the mixtureweights. Sampling from the random walk component is trivial.Sampling from the component for thei-th TDOA measure-ment can be achieved by first sampling a candidate delayaccording toDxn

∼ N(Dxn|Di,n, σ2

D), and then passing theresulting delay through the inverse of the mapping in (22), i.e.,xn = g−1(Dxn).

F. Tracker Architecture

We conclude this section by summarizing the compositionof our tracking algorithm. We consider two main scenarios.The first, summarized in Tab.III , is a desktop setting suchas depicted in Fig.2, where we use color as the main cue,and fuse it with the information in the sound localization cue.Such a setting forms the basis for video tele-conferencingapplications. The second setting, summarized in Tab.IV,is representative of surveillance and monitoring applicationsinvolving a static camera. Here sound localization cues will

With {x(i)n−1, w

(i)n−1}

Np

i=1 the particle set at the previous time step, proceedas follows at timen:Proposition: simulatex

(i)n ∼ qS(xn|x(i)

n−1,ySn).

Update weights:

w(i)n ∝ w

(i)n−1

p(ySn |x

(i)n )p(x

(i)n |x(i)

n−1)

qS(x(i)n |x(i)

n−1,ySn)

, withPNp

i=1 w(i)n = 1.

Resample:simulateai ∼ {w(k)

n }Np

k=1, and replacex(i)n ← x

(ai)n .

Proposition:simulate(y

(i)n , α

(i)n ) ∼ p(yn, αn|y(i)

n−1, α(i)n−1), and assemblex(i)

n ←(x

(i)n , y

(i)n , α

(i)n ).

Update weights: w(i)n ∝ p(yC

n |x(i)n ), with

PNp

i=1 w(i)n = 1.


n }Np


(i)n } ← {x(ai)

n , 1Np}.

TABLE III

PARTICLE FILTER FOR VISUAL TRACKING BASED ON COLOR AND SOUND.

With {x(i)n−1, w

(i)n−1}

Np

i=1 the particle set at the previous time step, proceedas follows at timen:Proposition: simulatep(i)

n ∼ qM (pn|p(i)n−1,yM

n ).Update weights:

w(i)n ∝w

(i)n−1

p(yMn |p(i)

n ,α(i)n−1)p(p

(i)n |p(i)

n−1)

qM (p(i)n |p(i)

n−1,yMn )

, withPNp

i=1 w(i)n = 1.

Resample:simulateai ∼ {w(k)

n }Np

k=1, and replacex(i)n ← x

(ai)n .

Proposition:simulateα

(i)n ∼ p(αn|α(i)

n−1), and assemblex(i)n ← (p

(i)n , α

(i)n ).

Update weights: w(i)n ∝ p(yC

n |x(i)n ), with

PNp

i=1 w(i)n = 1.


n }Np


(i)n } ← {x(ai)

n , 1Np}.

TABLE IV

PARTICLE FILTER FOR VISUAL TRACKING BASED ON COLOR AND MOTION.

NOTATION p STANDS FOR LOCATION(x, y).

generally be absent, and we thus fuse color with the motionlocalization cue.

In both settings we employ a form of partitioned sampling.In the first setting the sound likelihood only gives informationabout the objectx coordinate in the image. We thus simulateand resample for this component first, before simulating newvalues for the remaining state components (objecty coordinateand scale factorα) and resampling with respect to the colorlikelihood. This implies that we do not search directly in thethree dimensional state-space, but rather partition the inferenceinto a search in a one dimensional space, followed by anotherin a two dimensional space. In general this increases theefficiency of the particle filter, allowing us to achieve the sameaccuracy for a smaller number of particles. We follow a similarstrategy in the second setting where we fuse color and motionin that we first simulate and resample the location parameterswith respect to the motion likelihood, before simulating thescale factorα and resampling with respect to the colorlikelihood.


0 1000 2000 3000 4000

0 1000 2000 3000 4000

Sound signals

−6 −4 −2 0 2 4 6

x 10−4

0

0.1

0.2

0.3

0.4

GCCF and detected peak

−6 −4 −2 0 2 4 6

x 10−4

0

1

2

3

4x 10

4

Likelihood p(yS |Dx) as a function of TDOADx .

50 100 150 200 250 3000

1

2

3

4x 10

4

Likelihood p(yS |x) as a function of horizontal positionx.

Fig. 7. From sound measurements to TDOA likelihood. (Left) A stereo pair of 4096 sound samples corresponding to the 1/30 seconds of one videoframe. (Middle) GCCF of the two stereo signals, and detected peak. (Right) Associated likelihood as a function of the TDOA, and of the horizontal pixelcoordinate in the image frame, respectively.

IV. D EMONSTRATIONS

In this section we will demonstrate the performance ofour tracking algorithm on a number of challenging videosequences. We will first consider the behavior of the trackerwhen using each of the cues in isolation, and then showhow the shortcomings of such single modality trackers canbe eliminated by fusing the information from multiple cues.We will use the values in Tab.V for the fixed parameters of thelikelihood, proposal and state evolution. Due to the efficiencyof the motion and sound proposals, and the relatively lowdimensionality of the state-space, good tracking results canbe achieved with a reasonably small number of particles. Forour experiments we useNp = 100 particles.

A. Single Modality Tracking

1) Color Only: Since the color of an object is its mostpersistent feature, we use it as the main cue for our trackingalgorithm. As is the case for the deterministic color basedtrackers proposed in [4], [6], [9], our probabilistic tracker usingcolor only is able to robustly track objects undergoing complexchanges of shape and appearance. This is exemplified by thesequence of the football player in the top row of Fig.8. Therobustness of the tracker to color clutter and large movementsis further increased by its ability to incorporate multi-regioncolor models, as is illustrated in the home video example inthe bottom row of Fig.8.

The downside of the color persistence is its lack of discrimi-nating power in certain settings. In cases where the backgroundcontains objects of similar color characteristics to the objectof interest, the likelihood will exhibit a number of modes.A typical example is given in the office scene in Fig.3,where the wooden door is close in color space to the facesof the subjects being tracked. Under these circumstances theambiguities might lead to inaccurate tracking, or in the worstcase, a complete loss of lock.

In these scenarios the robustness of the tracking algorithmcan be greatly increased by fusing color cues with other cuesexhibiting complementary properties,i.e., with lower persis-tence, but being less prone to clutter. As discussed before,motion and sound are two such cues. The former will prove tobe a valuable addition to color in static camera settings, such astele-surveillance and smart environment monitoring, whereasthe latter will combine with color in tele-conferencing applica-tions, where a calibrated stereo microphone pair can easily be

added to the broadcasting or recording equipment. We will firstdemonstrate the power and limitations of motion and soundas single modality tracking cues, and then investigate the twofusion scenarios.

2) Motion Only: In this section we illustrate the behaviorof the tracker using the frame difference motion cue only.With this cue moving objects are tracked with a reasonabledegree of accuracy, as the result for the sequence in Fig.9indicates. However the motion cue is intermittent, and whenmotion ceases there is no more localization information, andthe particles diffuse under their dynamics, as is illustrated inFig. 10. We will see in SectionIV-B.1 how the fusion ofmotion with color effectively eliminates these shortcomings.

3) Sound Only: In this section we illustrate the behaviorof the sound only tracker,i.e., tracking only the horizontalposition of a speaker in the image, based on the TDOAmeasurements obtained from the stereo microphone pair. Aswill be evident, this cue not only allows localization in thex coordinate, but also endows the tracker with the abilityto switch focus between speakers as they take turns in theconversation.

We consider two sequences, each featuring two subjectsin discussion in front of a camera. In the first sequence theenvironment is relatively noise-free, and speech is consistentlydetected when present. Fig.11 shows snapshots of the trackingresult for this sequence. It is clear that the sound basedproposal defined in (32) allows the tracker to maintain focuson the current speaker to a sufficient degree of accuracy. It alsofacilitates the switching of focus between the speakers as theyalternate in conversation. Note that the tracking is accurateeven beyond the limit of the image plane, as is evident for thesubject on the right.

This result is even more encouraging in the light that it wasobtained using low cost off-the-shelf equipment. Extreme carein the placement of the microphones relative to the camerawas not required. The system was only roughly calibrated,and proved to be robust to the exact values chosen for theintrinsic parameters of the camera. Furthermore, no explicitattempts were made to compensate for the reverberation orbackground noise.

In the second sequence, for which snapshots are given inFig. 12, the signal-to-noise ratio is higher due to a higher levelof air-conditioner noise in the background, and the subjectsbeing further from the microphones. Due to the higher noiselevel speech is often undetected (NS = 0). This is further


standard deviations of dynamics (σx, σy , σα) = (5, 5, 10−2)uniform weight in dynamics βu = 0.01random walk weight in proposals βRW = 0.75standard deviations of likelihoods (σC , σM , σD) = (2× 10−2, 2× 10−2, 10−5)clutter probability in sound likelihood Pc = 10−3

speed of sound c = 342ms−1

microphone separation d = 0.2mfocal length f = 0.02m

width of the image plane fW = 0.01m

TABLE V

MODEL PARAMETERS FOR THE TRACKING EXPERIMENTS.

Fig. 8. Color based tracking with single and multi-region reference models. Using a global color reference model generated from a hand-selected regionin the initial frame, a region of interest (player 75 in the top sequence, the child in the bottom sequence) can be tracked robustly despite large motions,significant motion blur, dramatic shape changes, partial occlusions, and distracting colors in the background (other players in the top sequence, the sand andthe face of the mother in the bottom sequence). In the top sequence a single region was used, whereas two regions, corresponding to the face and shirt of thechild, were used in the bottom sequence. The yellow box in each frame indicates the MMSE estimate.

9 74 227 344

Fig. 9. Tracking moving objects with motion cues only. The uniformly initialized particle filter locks on to the moving vehicle a few frames after itenters the scene, thanks to the motion based proposal. The tracker maintains lock on the moving vehicle as it drives through the parking lot. The yellow boxrepresents the MMSE estimate.

513 524 535

Fig. 10. Intermittence of motion cues. When the vehicle tracked in the sequence of Fig.9 eventually stops, the motion cues disappear, and the particlesdiffuse under their dynamics. The rectangles indicate the hypothesized regions before resampling, with the yellow rectangles depicting the ten best particles.


1 64 212 251

274 298 338 398

Fig. 11. Sound based tracking under accurate speech detection. After the first period of silence (frames 1 to 17), where the uniformly initialized particleskeep on diffusing, the sound based tracker consistently tracks the horizontal position of the current speaker. The vertical segments indicate the positions of allthe particles at the current time step, with the yellow segments depicting the ten best particles. Only particles falling inside the image are displayed. Hencethe particle cloud is partly to completely invisible when the tracker correctly follows the speaker on the right exiting and re-entering the field of view of thecamera (e.g., frames 251, 274, 298).

exemplified by the graphs in Fig.13 that show the speechdetections for each speaker in relation to the ground truth.These detection failures result in a rapid diffusion of theparticles, as is evident from a number of snapshots in Fig.12.

Since the sound cue lacks persistence, either through detec-tion failures or the absence of speech, the sound based trackeris unable to provide consistent tracking in time. In addition,the sound cue gives no information about the vertical locationy or scale factorα of the object being tracked. We will see inSectionIV-B.2 how the fusion of sound with color will solvethese problems, while retaining the desirable properties of thesound localization cue.

B. Multiple Modality Tracking

1) Color and Motion: As we have mentioned in SectionIV-A.1, the greatest weakness of the color localization cue isthe ambiguity that results due to the presence in the scene ofobjects or regions with color features similar to those of theobject of interest. An extreme example where tracking basedon color only fails completely is given in Fig.14.

By fusing the color and motion cues the ambiguity can begreatly reduced if the object of interest is indeed moving. Thisis exemplified by the likelihood maps for each of the individualcues and the combination of the cues in Fig.15. As afurther illustration the combination of color and motion allowsaccurate tracking of the vehicle in Fig.16, where color onlytracking failed. During periods of motion the tracker utilizesmainly the information in the motion cue, while relying on thecolor information for the localization when the motion ceasesat the end of the sequence.

The utility of the motion based proposal is illustrated by thetracking results on the sequence in Fig.17. In this case thelocalization information is ambiguous in both the color andmotion cues, when considered individually. Without the eventbased motion proposal a uniformly initialized tracker simply

settles on one of the spurious modes of the combined likeli-hood. With the help of the motion based proposal, however, thetracker is able to lock on to the target (face) as soon as it entersthe scene. Even though the motion based proposal continuesto generate hypotheses in all the regions of significant motionactivity (face, torso and monitor), those that do not complywith the color likelihood are eliminated during the resamplingprocedure, so that lock is maintained, as is shown in Fig.18.

2) Color and Sound:To demonstrate the fusion of colorand sound we consider again the second sequence presentedin SectionIV-A.3. We now initialize a reference color modelcomposed of three subregions (for better positioning accuracy)on the face of the left subject, as depicted in Fig.19. Usingcolor only the particle filter, after a uniform initialization, lockson to one of the subjects at random, and maintains lock onthis subject throughout the video sequence, as is shown in Fig.19.

By incorporating the sound localization cues, the tracker isable to jump between the subjects as they take turns in theconversation, as is depicted in Fig.20. During the absenceof sound cues, either through detection failures or periods ofsilence, the tracker maintains accurate lock on the last activespeaker due to the information in the color cues.

V. CONCLUSIONS

In this paper we introduced generic mechanisms for datafusion within particle filtering, and used them to develop aparticle filter based tracker combining color, motion and soundlocalization cues in a novel way. Color, being the most per-sistent, was used as the main cue for tracking. The ambiguityinherent in color representations was resolved by the moreintermittent, but less ambiguous, motion and stereo soundcues. We considered two main scenarios. In the first, colorwas combined with sound in a desktop setting that forms thebasis for video tele-conferencing applications. In the second,representative of more general tracking applications such assurveillance, we fused color with the motion localization cue.


5 18 47 62 79 191

200 214 219 222 252 284

Fig. 12. Sound based tracking under sporadic speech detection. Due to the low signal-to-noise ratio speech is only occasionally detected (e.g., frames18, 62, 191, 214, 219, 252). The remainder of the time no localization cues are available, and the particle filter simply diffuses under its dynamics (e.g.,frames 47, 79, 200, 222, 284).

left person right person

Fig. 13. Speech detection against ground truthfor the two subjects of the sequence in Fig.12. The left (resp. right) graph concerns the subject on theleft (resp. right) in the scene. The lines indicate the speech detections, which all correspond to correct horizontal localization of the speaker in the image.The shaded area indicates the hand-labeled ground truth.

car that shouldbe tracked car that should

be tracked

car that shouldbe tracked

car that shouldbe tracked

9 74 227 344

Fig. 14. Color tracking under ambiguity . The reference color histogram, generated from a region selected by hand around the moving vehicle in frame74, leads to a high degree of ambiguity. Soon after a uniform initialization, the tracking algorithm gets stuck on a very strong local minimum of the colorlikelihood, corresponding to one of the parked vehicles. The yellow box represents the MMSE estimate.

p(yC |(x, y, 1)) p(yM |(x, y, 1)) p(yC ,yM |(x, y, 1))Fig. 15. Color and motion compound likelihood. Head movement, as captured by the motion likelihood, helps to reduce the ambiguity inherent in thecolor cues, as is exemplified by the three likelihood maps: color likelihood, motion likelihood, and the product of the two likelihoods, plotted as a functionof the location only (scale factor fixed toα = 1).

In both scenarios the combination of cues proved to be morerobust than any of the cues individually.

As is standard practice, we defined independent likelihoodmodels for each of the localization cues. We also constructedmixture proposal distributions based on the motion and soundcues, whenever these were available. These proposals actas detection and initialization modules for the particle filter.They also facilitate a more efficient exploration of the state-space, and aid the recovery of lock following periods ofpartial or complete occlusion. Within the context of multi-object tracking such event based proposals are essential for thedetection of new objects when they appear in the scene. Wealso used different stratified sampling procedures to increase

the efficiency of the particle filter. These layered procedureseffectively substitute the difficult estimation problem in thecomplete state-space with a sequence of easier estimationproblems, each in a reduced space.

This combination of appropriate conditionally independentdata models, event based proposal densities, and layeredimportance sampling could now be extended to other visualtracking scenarios involving the fusion of other information,such as shape cues when a predefined class of object has tobe tracked.

Although not considered in this paper, the fusion of multiplemeasurement modalities is an essential requirement in adaptivesystems. In such scenarios a high degree of confidence in one


4 9 74

227 344 535

Fig. 16. Fusing color and motion. As was the case in Fig.14, the reference color model is generated from a region selected by hand around the movingvehicle in frame 74. The motion based proposal allows the tracker to lock on to the moving vehicle as soon as it enters the scene. Lock is maintainedthroughout the period of motion, mostly due to the presence of the motion cues. When motion ceases towards the end of the sequence, the tracker relies onthe color cues to maintain lock, in contrast to the motion only tracking in Fig.10. The yellow box represents the MMSE estimate.

Motion cues

Color cues

Color and motion cues

Color and motion cues, and motion based proposal

8 12 38 59

Fig. 17. Using color and motion cues with the motion based proposal. (Top) In this example of a person traversing the field of view, the motion cuesare very ambiguous due to the activity around the face and torso of the moving person, and around the computer monitor. (Second row) The reference colormodel, initialized on the face of the person, leads to additional ambiguities. After a uniform initialization the particles settle on a spurious local mode ofthe color likelihood. It is only towards the end of the sequence, when the face approaches this region, that the particles lock on to the desired target. (Thirdrow) Combining the color and motion modalities, while retaining the smooth proposal, does not alter the behavior of the color only tracker significantly (theparticles move to the face a few frames earlier). (Bottom) In contrast, the combined tracker with the motion based proposal allows the tracker to lock on tothe face as soon as it enters the scene, and to track it throughout. All regions of high motion activity are continually explored (e.g., face, torso and monitor),but those that do not comply with the color model are discarded during the resampling procedure. The rectangles indicate the hypothesized regions beforeresampling, with the yellow rectangles depicting the ten best particles.


12 38 50 59 70

Fig. 18. Using color and motion cues with the motion based proposal. MMSE result for the tracker that combines the color and motion localizationcues with the motion based proposal on the sequence in Fig.17.

1 2 7

Fig. 19. Two different runs of the color tracker . The reference color model is defined on the three-fold selection on the left in frame 1. After a uniforminitialization the tracker rapidly locks on to one of the subjects at random, and maintains lock on this subject throughout the sequence. The rectangles indicatethe bounding boxes for the particles, with the yellow rectangles depicting the ten best particles.

5 18 47 62 79 191

200 214 219 222 252 284

Fig. 20. Fusing color and sound. The tracking result for the fusion of color and sound cues on the same frames as in Fig.12. The sound based proposalallows the tracker to jump between the subjects as they alternate in conversation (frames 18, 62, 191, 214, 219, 252). In the absence of sound cues lock ismaintained on the last active speaker due to the information in the color cues. The yellow rectangle in each of the frames depicts the MMSE estimate of theparticle bounding box.

modality facilitates the adaptation of the model parameters as-sociated with complementary modalities, while simultaneouslyminimizing the risk of learning a wrong appearance model,such as the background [8], [24], [55], [59]. Such adaptivesystems are most useful in multi-object tracking systems whereit may be desirable to individualize a generic model to eachof the objects in the scene.

ACKNOWLEDGMENT

The authors would like to thank Michel Gangnet for hiscontribution in the experimental phase of the investigation.

REFERENCES

[1] B. Anderson and J. Moore.Optimal Filtering. Prentice-Hall, EnglewoodCliffs, 1979.

[2] A. Blake, R. Curwen, and A. Zisserman. A framework for spatiotem-poral control in the tracking of visual contours.Int. J. Computer Vision,11(2):127–145, 1993.

[3] A. Blake and M. Isard.Active contours. Springer, London, 1998.[4] G. Bradski. Computer vision face tracking as a component of a

perceptual user interface. InWorkshop on Applications of ComputerVision, pages 214–219, Princeton, NJ, Oct. 1998.

[5] G. Carter. Coherence and time delay estimation.Proceedings of theIEEE, 75(2):236–255, 1987.

[6] H. Chen and T. Liu. Trust-region methods for real-time tracking. InProc. Int. Conf. Computer Vision, pages II: 717–722, Vancouver, Canada,July 2001.

[7] K. Choo and D.J. Fleet. People tracking using hybrid Monte Carlofiltering. In Proc. Int. Conf. Computer Vision, pages II: 321–328,Vancouver, Canada, July 2001.

[8] R. Collins and Y. Liu. On-line selection of discriminative trackingfeatures. InProc. Int. Conf. Computer Vision, pages 346–352, Nice,France, October 2003.

[9] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigidobjects using mean shift. InProc. Conf. Comp. Vision Pattern Rec.,pages II:142–149, Hilton Head, SC, June 2000.

[10] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking.IEEE Trans. Pattern Anal. Machine Intell., 25(5):564–577, 2003.

[11] J. Crowley, J. Coutaz, and F. Brard. Things that see: Machine perceptionfor human computer interaction.Communications of the ACM, 43(3):54–64, 2000.

[12] S. Das Peddada and R. McDevitt. Lesat average residual algorithm(LARA) for tracking the motion of arctic sea ice. IEEE. Trans.Geoscience and Remote Sensing, 34(4):915–926, 1996.

[13] F. Dellaert, W. Burgard, D. Fox, and S. Thrun. Using the Condensationalgorithm for robust, vision-based mobile robot localization. InProc.Conf. Comp. Vision Pattern Rec., pages II: 588–594, Fort Collins, CO,June 1999.


[14] A. Doucet, N. de Freitas, and N. Gordon, editors.Sequential MonteCarlo Methods in Practice. Springer-Verlag, New-York, 2001.

[15] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlosampling methods for Bayesian filtering.Statistics and Computing,10(3):197–208, 2000.

[16] J. Geweke. Bayesian inference in econometrics models using MonteCarlo integration.Econometrica, 57:1317–1339, 1989.

[17] N. Gordon. A hybrid bootstrap filter for target tracking in clutter.IEEE Transactions on Aerospace and Electronic Systems, 33(1):353–358, 1997.

[18] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation.IEE Proceedings-F, 140(2):107–113, 1993.

[19] I. Haritaoglu, D. Harwood, and L. Davis. W4: Real-time surveillance ofpeople and their activities.IEEE Trans. Pattern Anal. Machine Intell.,22(8):809–830, 2000.

[20] M. Isard and A. Blake. CONDENSATION–conditional density propa-gation for visual tracking.Int. J. Computer Vision, 29(1):5–28, 1998.

[21] M. Isard and A. Blake. ICONDENSATION: Unifying low-level andhigh-level tracking in a stochastic framework. InProc. Europ. Conf.Computer Vision, pages 893–908, 1998.

[22] M. Isard and A. Blake. A smoothing filter for condensation. InProc.Europ. Conf. Computer Vision, pages I:767–781, 1998.

[23] M. Isard and J. MacCormick. BraMBLe: a Bayesian multiple-blobtracker. InProc. Int. Conf. Computer Vision, pages II: 34–41, Vancouver,Canada, July 2001.

[24] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearancemodels for visual tracking. InProc. Conf. Comp. Vision Pattern Rec.,pages I:415–422, Kauai, Hawaii, Decembrer 2001.

[25] C. Kervrann and F. Heitz. A hierarchical Markov modeling approachfor the segmentation and tracking of deformable shape.Graph. Mod.Image Proc., 60(3):173–195, 1998.

[26] G. Kitagawa. Monte Carlo filter and smoother for non-Gaussiannonlinear state space models.J. Computational and Graphical Stat.,5(1):1–25, 1996.

[27] C. Knapp and G. Carter. The generalized correlation method forestimation of time delay.IEEE Transactions on Acoustics, Speech, andSignal Processing, ASSP-24(4):320–327, 1976.

[28] E. Koller-Meier and F. Ade. Tracking multiple objects using theCondensation algorithm.Journal of Robotics and Autonomous Systems,34(2-3):93–105, 2001.

[29] J. Konrad. Motion detection and estimation. In A. Bovik, editor,Handbook of Image and Video Processing, pages 207–225. AcademicPress, 2000.

[30] F. Leymarie and M. Levine. Tracking deformable objects in the planeusing an active contour model.IEEE Trans. Pattern Anal. MachineIntell., 15(6):617–634, 1993.

[31] J. Liu and R. Chen. Sequential Monte Carlo methods for dynamicsystems.Journal of the American Statistical Association, 93:1032–1044,1998.

[32] J. MacCormick and A. Blake. Probabilistic exclusion and partitionedsampling for multiple object tracking.International Journal of ComputerVision, 39(1):57–71, 2000.

[33] J. MacCormick and M. Isard. Partitioned sampling, articulated objects,and interface-quality hand tracking. InProc. Europ. Conf. ComputerVision, pages 34–41, Dublin, Ireland, June 2000.

[34] E. Malis, F. Chaumette, and S. Boudet. 2 1/2 D visual servoing.IEEETrans. on Robotics and Automation, 15(2):238–250, 1999.

[35] E. Marchand and F. Chaumette. Virtual visual servoing: a frameworkfor real-time augmented reality. InProc. Eurographics, pages 289–298,Saarebrcken, Germany, September 2002.

[36] T. Moeslund and E. Granum. A survey of computer vision-based humanmotion capture.Computer Vision and Image Understanding, 81(3):231–268, 2001.

[37] C. Papin, P. Bouthemy, and G. Rochard. Unsupervised segmentation oflow clouds from infrared meteosat images based on a contextual spatio-temporal labeling approach.IEEE Trans. on Geoscience and RemoteSensing, 40(1):104–114, 2002.

[38] A. Pentland. Looking at people: sensing for ubiquitous and wearablecomputing. IEEE Trans. Pattern Anal. Machine Intell., 22(1):107–119,2000.

[39] P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistictracking. In Proc. Europ. Conf. Computer Vision, pages I: 661–675,Copenhagen, Denmark, May 2002.

[40] N. Peterfreund. Robust tracking of position and velocity with Kalmansnakes.IEEE Trans. Pattern Anal. Machine Intell., 21(6):564–569, 1999.

[41] V. Philomin, R. Duraiswami, and L.S. Davis. Quasi-random samplingfor Condensation. InProc. Europ. Conf. Computer Vision, pages 134–149, Dublin, Ireland, June 2000.

[42] D. Rabinkin, R. Renomeron, A. Dahl, J. French, J. Flanagan, andM. Bianchi. A DSP implementation of source location using microphonearrays. InProceedings of the SPIE, volume 2846, pages 88–99, 1996.

[43] Y. Rui and Y. Chen. Better proposal distributions: Object tracking usingunscented particle filter. InProc. Conf. Comp. Vision Pattern Rec., pagesII:786–794, Kauai, Hawaii, Decembrer 2001.

[44] H. Sidenbladh and M. Black. Learning the stastitics of people in imagesand videos.Int. J. Computer Vision, 54(1/2/3):183–209, 2003.

[45] H. Sidenbladh, M.J. Black, and D.J. Fleet. Stochastic tracking of 3Dhuman figures using 2D image motion. InProc. Europ. Conf. ComputerVision, pages II: 702–718, Dublin, Ireland, June 2000.

[46] H. Sidenbladh, M.J. Black, and L. Sigal. Implicit probabilistic modelsof human motion for synthesis and tracking. InProc. Europ. Conf.Computer Vision, pages I: 784–800, Copenhagen, Denmark, May 2002.

[47] H. Silverman and E. Kirtman. A two-stage algorithm for determiningtalker location from linear microphone array data.Computer Speechand Language, 6:129–152, 1992.

[48] C. Sminchisescu and B. Triggs. Covariance scaled sampling formonocular 3D body tracking. InProc. Conf. Comp. Vision Pattern Rec.,pages I:447–454, Kauai, Hawaii, Decembrer 2001.

[49] C. Sminchisescu and B. Triggs. Hyperdynamics importance sampling.In Proc. Europ. Conf. Computer Vision, pages I: 769–783, Copenhagen,Denmark, May 2002.

[50] M. Spengler and B. Schiele. Towards robust multi-cue integrationfor visual tracking. In Int. Workshop on Computer Vision Systems,Vancouver, Canada, July 2001.

[51] P. Svaizer, M. Matassoni, and M. Omologo. Acoustic source locationin a three-dimensional space using crosspower spectrum phase. InProceedings of the IEEE International Conference on Acoustic, Speechand Signal Processing, pages 231–234, 1997.

[52] A. Thayananthan, B. Stenger, Ph. Torr, and R. Cipolla. Shape contextand chamfer matching in cluttered scenes. InProc. Conf. Comp. VisionPattern Rec., pages 127–133, Madison, WI, June 2003.

[53] K. Toyama and A. Blake. Probabilistic tracking in a metric space. InProc. Int. Conf. Computer Vision, pages II: 50–57, Vancouver, Canada,July 2001.

[54] J. Vermaak, A. Blake, M. Gangnet, and P. Perez. Sequential MonteCarlo fusion of sound and vision for speaker tracking. InProc. IEEEInt. Conf. Computer Vision, pages I:741–746, Vancouver, Canada, July2001.

[55] J. Vermaak, P. Perez, M. Gangnet, and A. Blake. Towards improvedobservation models for visual tracking: selective adaptation. InProc.Europ. Conf. Computer Vision, pages I: 645–660, Copenhagen, Den-mark, May 2002.

[56] W. Vieux, Schwerdt. K., and J. Crowley. Face-tracking and coding forvideo compression. InProc. Int. Conf. on Computer Vision Systems,pages 151–160, Las Palmas, Spain, January 1999.

[57] H. Wang and P. Chu. Voice source localization for automatic camerapointing system in videoconferencing. InProceedings of the IEEEInternational Conference on Acoustic, Speech and Signal Processing,pages 187–190, 1997.

[58] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body.IEEE Trans. Pattern Anal. MachineIntell., 19(7):780–785, 1997.

[59] Y. Wu and T.S. Huang. A co-inference approach to robust visualtracking. In Proc. Int. Conf. Computer Vision, pages II: 26–33,Vancouver, Canada, July 2001.

Patrick Perezwas born in 1968. He graduated fromEcole Centrale Paris, France, in 1990 and receivedthe Ph.D. degree at the University of Rennes, France,in 1993. After one year as an Inria post-doctoralresearcher in the Dpt of Applied Mathematics atBrown University (Providence, USA), he was ap-pointed at Inria in 1994 as a full time researcher.In 2000, he joined Microsoft Research, Cambridge,U.K. His research interests include probabilisticmodels for understanding, analysing, and manipu-lating still and moving images.


Jaco Vermaak was born in South Africa in 1969.He received the B.Eng. and M.Eng. degrees fromthe University of Pretoria, South Africa in 1993 and1996, respectively, and the Ph.D. degree at the Uni-versity of Cambridge, U.K. in 2000. He has workedas a post-doctoral researcher at Microsoft Research,Cambridge, U.K. between 2000 and 2002. He iscurrently employed as a senior research associatein the Signal Processing Group of the CambridgeUniversity Engineering Department. His researchinterests include audio-visual tracking techniques,

multi-media manipulation, statistical signal processing methods and machinelearning.

Andrew Blake was on the faculty of ComputerScience at the University of Edinburgh, and alsoa Royal Society Research Fellow, until 1987, thenat the Department of Engineering Science in theUniversity of Oxford, where he became a Professorin 1996, and Royal Society Senior Research Fellowin 1998. In 1999 he was appointed Senior Re-search Scientist at Microsoft Research, Cambridge,while continuing as visiting Professor at Oxford.His research interests are in computer vision, signalprocessing and learning. He has published a number

of papers in vision, a book with A.Zisserman (“Visual Reconstruction”, MITpress), edited “Active Vision” with Alan Yuille (MIT Press) and a book(“Active Contours”, Springer-Verlag) with Michael Isard. He was electedFellow of the Royal Academy of Engineering in 1998.

data fusion for visual tracking with particles - citeseerx

Documents