robust template tracking with drift correction

9
Robust template tracking with drift correction David Schreiber * Advanced Computer Vision GmbH – ACV, Donau-City-Strasse 1, A-1220 Vienna, Austria Received 3 July 2006; received in revised form 11 January 2007 Available online 27 March 2007 Communicated by R. Davies Abstract We propose an efficient robust version of the Lucas–Kanade template matching algorithm. The robust weights used by the algorithm are based on evidence which is accumulated over many frames. We also present a robust extension of the algorithm proposed by Mat- thews et al. [Matthews, I., Ishikawa, T., Baker, S., 2004. The template update problem. IEEE Trans. Pattern Anal. Machine Intell. 26 (6), 810–815] which corrects the template drift. We demonstrate that in terms of tracking accuracy, the robust version of the drift-correcting algorithm outperforms the original algorithm, while remaining still extremely fast. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Template tracking; The Lucas–Kanade algorithm; Robust least squares 1. Introduction The tracking problem can be simply defined as follows: Given a current frame of a video and the location of an object in the previous frame, find its location in the current frame. The three main categories into which most algo- rithms fall are feature-based tracking (e.g. Beymer et al., 1997; Smith and Brady, 1995), contour-based tracking (e.g. Isard and Blake, 1998; Yokoyama and Poggio, 2005) and region-based tracking. In the last category, the region’s content is used either directly (template matching, e.g. Lucas and Kanade, 1981), or is represented by a non- parametric description such as a histogram (most notably, kernel-based tracking using the mean-shift algorithm, e.g. Bradski, 1998 and Comaniciu et al., 2003). Template tracking dates back to Lucas and Kanade (1981). An object is tracked through a video sequence by extracting a template in the first frame and then finding the region which matches the template as closely as possi- ble in the remaining frames. Template tracking has been extended in various ways (Matthews et al., 2004), such as: (1) to allow arbitrary parametric transformations of the template (Bergen et al., 1992), (2) to allow linear appearance variations (Black and Jepson, 1998; Hager and Belhumeur, 1998), (3) to be more efficient (Hager and Belhumeur, 1998; Baker and Matthews, 2004), and (4) to deal with cases where the object is partially occluded or that the template contains pixels which belong to the background (Hager and Belhumeur, 1998; Ishikawa et al., 2002). The combination of these ideas has yielded non-rigid active appearance models, such as Cootes et al. (2001) and Sclaroff and Isidoro (1998). The underlying assumption behind template tracking is that the appearance of the object remains the same throughout the entire video sequence, an assumption which is being often violated. One solution to this problem is to update the template every frame (or every n frames) with a new template extracted from the current image at the cur- rent location of the template. The problem with this naı ¨ve strategy is that the template drifts. Small errors are intro- duced in the location of the template. With each update, these errors accumulate and the template steadily drifts away from the object (Matthews et al., 2004). 0167-8655/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.03.007 * Fax: +43 50 5504150. E-mail address: [email protected] www.elsevier.com/locate/patrec Pattern Recognition Letters 28 (2007) 1483–1491

Upload: david-schreiber

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

www.elsevier.com/locate/patrec

Pattern Recognition Letters 28 (2007) 1483–1491

Robust template tracking with drift correction

David Schreiber *

Advanced Computer Vision GmbH – ACV, Donau-City-Strasse 1, A-1220 Vienna, Austria

Received 3 July 2006; received in revised form 11 January 2007Available online 27 March 2007

Communicated by R. Davies

Abstract

We propose an efficient robust version of the Lucas–Kanade template matching algorithm. The robust weights used by the algorithmare based on evidence which is accumulated over many frames. We also present a robust extension of the algorithm proposed by Mat-thews et al. [Matthews, I., Ishikawa, T., Baker, S., 2004. The template update problem. IEEE Trans. Pattern Anal. Machine Intell. 26 (6),810–815] which corrects the template drift. We demonstrate that in terms of tracking accuracy, the robust version of the drift-correctingalgorithm outperforms the original algorithm, while remaining still extremely fast.� 2007 Elsevier B.V. All rights reserved.

Keywords: Template tracking; The Lucas–Kanade algorithm; Robust least squares

1. Introduction

The tracking problem can be simply defined as follows:Given a current frame of a video and the location of anobject in the previous frame, find its location in the currentframe. The three main categories into which most algo-rithms fall are feature-based tracking (e.g. Beymer et al.,1997; Smith and Brady, 1995), contour-based tracking(e.g. Isard and Blake, 1998; Yokoyama and Poggio,2005) and region-based tracking. In the last category, theregion’s content is used either directly (template matching,e.g. Lucas and Kanade, 1981), or is represented by a non-parametric description such as a histogram (most notably,kernel-based tracking using the mean-shift algorithm, e.g.Bradski, 1998 and Comaniciu et al., 2003).

Template tracking dates back to Lucas and Kanade(1981). An object is tracked through a video sequence byextracting a template in the first frame and then findingthe region which matches the template as closely as possi-ble in the remaining frames. Template tracking has been

0167-8655/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2007.03.007

* Fax: +43 50 5504150.E-mail address: [email protected]

extended in various ways (Matthews et al., 2004), suchas: (1) to allow arbitrary parametric transformations ofthe template (Bergen et al., 1992), (2) to allow linearappearance variations (Black and Jepson, 1998; Hagerand Belhumeur, 1998), (3) to be more efficient (Hagerand Belhumeur, 1998; Baker and Matthews, 2004), and(4) to deal with cases where the object is partially occludedor that the template contains pixels which belong to thebackground (Hager and Belhumeur, 1998; Ishikawaet al., 2002). The combination of these ideas has yieldednon-rigid active appearance models, such as Cootes et al.(2001) and Sclaroff and Isidoro (1998).

The underlying assumption behind template tracking isthat the appearance of the object remains the samethroughout the entire video sequence, an assumption whichis being often violated. One solution to this problem is toupdate the template every frame (or every n frames) witha new template extracted from the current image at the cur-rent location of the template. The problem with this naı̈vestrategy is that the template drifts. Small errors are intro-duced in the location of the template. With each update,these errors accumulate and the template steadily driftsaway from the object (Matthews et al., 2004).

1484 D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491

A remedy for the drift problem was proposed in (Mat-thews et al., 2004). As well as maintaining a current esti-mate of the template, the algorithm also retains the firsttemplate from the first frame. The template is first updatedas in the naı̈ve algorithm. To eliminate drift, this updatedtemplate is then aligned with the first template to give thefinal update.

However, the drift-correcting algorithm in (Matthewset al., 2004) is still sensitive to variations in the appearanceof the object relative to the first template. In case of toolarge appearance variations (e.g. the object is partiallyoccluded), the alignment of the first template with the cur-rent frame, which is the second step of the drift-correctingalgorithm, might fail. In case that the appearance change ofthe object is persistent over many frames, the drift wouldbuild up just as in the case of the naı̈ve algorithm. Thiseffect is even more apparent in realistic scenarios wherethe bounding box is not initialised by hand (such as inthe examples given by Matthews et al., 2004), but ratheris produced by an automatic detection module. In such acase, the bounding box does not match perfectly the outercontour of the object and thus contains considerably morebackground pixels.

To handle partial occlusions, appearance variations andpresence of background pixels, robust template matchingalgorithms were proposed (e.g. Hager and Belhumeur,1998 and Ishikawa et al., 2002). The goal of these robusttracking algorithms is to use a weighted least square pro-cess, such that occluded regions, background pixels andregions where brightness have changed would be consid-ered as outliers and would be suppressed. However, theserobust algorithms require a trade-off between efficiencyand accuracy. Namely, in (Hager and Belhumeur, 1998)the Hessian matrix depends on outliers, while in (Ishikawaet al., 2002) the template is divided into a few blocks,assuming a constant weight for each block.

In this paper we present an alternative robust templatematching algorithm where the robust weights do notchange during tracking iterations for a given frame, butare rather adaptively updated for each new frame. Thisway we achieve a robust tracking algorithm which is extre-mely fast on one hand, while not requiring any compromiseas in (Hager and Belhumeur, 1998) or in (Ishikawa et al.,2002) on the other hand. This novel robust templatematching algorithm is then combined with the drift-cor-recting algorithm proposed in (Matthews et al., 2004), toprovide a robust drift-correcting algorithm.

The rest of this paper is organized as follows: In Section2 we outline the inverse compositional algorithm of Bakerand Matthews (2004), its robust versions of Ishikawa et al.(2002) and Hager and Belhumeur (1998), and the drift-cor-recting extension of Matthews et al. (2004). In Section 3 wepresent our alternative robust algorithm and then showhow it can extend the drift-correcting algorithm. In Section4 we give a qualitative as well as quantitative evaluation ofour proposed algorithm. Section 5 contains our conclu-sions and an outline of future work.

2. Outline of template tracking algorithms

2.1. The inverse compositional algorithm

In the rest of this paper, we follow the notations of Mat-thews et al. (2004). Let In(x) stand for the nth image in agiven video sequence, where x = (x,y)T are the pixel co-ordinates and n = 0,1,2, . . . is the frame number. A sub-region of the initial frame I0(x) is extracted and becomesthe template T(x). Let W(x;p) denote the parametrizedset of allowed deformations of the template, wherep = (p1, . . . ,pk)T is a vector of parameters. The warpW(x;p) takes the pixel x in the coordinate frame of the tem-plate T(x) and maps it to a sub-pixel location,W(x;p), inthe coordinate frame of the image In(x). In general, theset of allowed warps depends on the types of motion weexpect from the object being tracked. For example, if theobject is roughly planar and is parallel to the image plane,translating but not rotating, we use the following 2D imagewarp with three parameters, p = (S,Tx,Ty)T:

Wðx; pÞ ¼ Sx

y

� �þ

T x

T y

� �: ð1Þ

The goal of the template tracking is to find the best matchto the template in every subsequent frame in the video. Inthe original Lucas–Kanade algorithm (Lucas and Kanade,1981), the best match is found by minimizing the followingSSD function, where the summation is over all pixels of thetemplate:Xx2T

½InðWðx; pÞÞ � T ðxÞ�2: ð2Þ

The minimization is performed with respect to the warpingparameters p. Due to its non-linear nature, optimization isdone iteratively solving for increments Dp to the alreadyknown parameters p:Xx2T

½InðWðx; pþ DpÞÞ � T ðxÞ�2: ð3Þ

The inverse compositional algorithm (Baker and Mat-thews, 2004) is a more efficient version of the algorithm,where the roles of the template and the new image areswitched. In this case, the following expression is to beiteratively minimized:Xx2T

½InðWðx; pÞÞ � T ðWðx; DpÞÞ�2: ð4Þ

Performing a first-order Taylor expansion in Eq. (4) gives

Xx2T

InðWðx; pÞÞ � T ðxÞ � rToW

opDp

� �2

: ð5Þ

Minimizing Eq. (5) is a least squares problem. The closedform solution is obtained by taking the partial derivativeof Eq. (5) and then setting it to equal zero. The solutionobtained is (Baker and Matthews, 2004):

D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491 1485

Dp ¼ H�1Xx2T

rToW

op

� �T

½InðWðx; pÞÞ � T ðxÞ�; ð6Þ

where H is the Gauss–Newton approximation to theHessian matrix:

H ¼Xx2T

rToW

op

� �T

rToW

op

� �; ð7Þ

which does not depend on the warping parameters andthus can be pre-computed.

2.2. The iteratively updated robust algorithm

In (Ishikawa et al., 2002; Baker et al., 2003), the robust(modified weights) inverse compositional algorithm isderived by minimizingXx2T

qð½IðWðx; pÞÞ � T ðWðx; DpÞÞ�2Þ; ð8Þ

where q is a robust estimator (Huber, 1981). The weightedleast square solution is a generalization of Eq. (6) (Bakeret al., 2003):

Dp ¼ H�1q

Xx2T

q0ð½InðWðx; pÞÞ � T ðxÞ�2Þ

� rToW

op

� �½InðWðx; pÞÞ � T ðxÞ�; ð9Þ

where the robust Hessian matrix is

Hq ¼Xx2T

q0ð½IðWðx; pÞÞ � T ðxÞ�2Þ rToW

op

� �T

rToW

op

� �:

ð10ÞHowever, the Hessian matrix contains a weighting functionwhich is updated every iteration and thus cannot be pre-computed. The solution suggested in (Ishikawa et al.,2002) is to subdivide the template into a set of blocks.Based on the spatial coherence of the outliers, the weightis assumed to be constant on each block. Assuming thenumber of blocks to be much smaller than the number ofpixels in the template, the complexity of re-computing theHessian becomes insignificant. The algorithm proceeds bysorting the weights of all the blocks, and then consideringa fixed percentage of them as outliers.

A different robust algorithm (the modified residuals) wasproposed in (Hager and Belhumeur, 1998). The algorithmis similar to but slightly different from the inverse compo-sitional algorithm. In this approach, the Hessian does notdepend on the weighting function and need not be re-com-puted each iteration. Thus, no block approximation is nec-essary. In (Baker et al., 2003) a comparison between themodified weights and the modified residuals algorithms isperformed. The conclusion is that the modified weightsalgorithm outperforms the modified residuals algorithmdue to the fact that in the former, the Hessian does notdepend on outliers.

The modified weights algorithm needs to make a deci-sion for each block, based on the error of each block,whether it contains outliers. However, since differentblocks have different amount of texture, the error functionmust be normalized, for example, by the variance of thepixels in the block (Ishikawa et al., 2002). However, thissetting is made ad hoc, and cannot guarantee that the algo-rithm always properly distinguishes between object andbackground pixels. The fact that the decision is based onblocks and not on individual pixels renders such a mis-classification even more likely. An additional disadvantageof the robust algorithm of Ishikawa et al. (2002) is that,although the weighted least squares algorithm is theoreti-cally guaranteed to converge to the correct solution(Huber, 1981), it might not do so in practice. As a non-lin-ear minimization problem which is solved using gradientdescent method, the robust algorithm might converge toa local minimum, if the outliers in the template are notsuppressed enough during the first iteration.

2.3. The inverse compositional algorithm with driftcorrection

The problem addressed in (Matthews et al., 2004) is howto update the template every frame and still avoid its drift.The idea is to use the first template T1(x) = I0(x) to correctthe drift in Tn+1(x). The image In(x) is first tracked withtemplate Tn(x), starting from the previous warp parameterspn�1. The result is the tracked image In(W(x;pn)) and theparameters pn. The more accurate parameters p�n areobtained by tracking T1(x) in In(x) starting at parameterspn. The new template T nþ1ðxÞ ¼ InðWðx; p�nÞÞ is then com-puted with the updated parameters p�n.

This algorithm can be interpreted in two equivalentways (Matthews et al., 2004):

1. The template is updated every frame, but it must addi-tionally be realigned to the original template T1(x) toremove the drift.

2. The template is not updated at all, but tracking using theconstant template T1(x) is initialized by tracking firstwith T nðxÞ ¼ In�1ðWðx; p�n�1ÞÞ to avoid local minima.

3. An alternative robust algorithm

3.1. Extending the inverse compositional algorithm

In this section we introduce our robust extension to theinverse compositional algorithm that uses a fixed template.Denoting the robust weights per pixel, used for tracking thetemplate T1(x) = I0(x) in image In(x), by xn(x), we rewriteEqs. (9) and (10) as follows. The robust least squares solu-tion is

Dp ¼ H�1q

Xx2T

xnðxÞ rToW

op

� �½InðWðx; pÞÞ � T ðxÞ� ð11Þ

1486 D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491

and the robust Hessian matrix is

Hq ¼Xx2T

xnðxÞ rToW

op

� �T

rToW

op

� �: ð12Þ

During the tracking iterations of the first templateT1(x) = I0(x) in the current image In(x), the robust weightsxn(x) are fixed and thus the Hessian can be pre-computed.After the new warp parameters pn are found, the weightsxn(x) are updated as follows. First, the current error func-tion between the template T1(x) and the warped imageIn(W(x;pn)) is computed

f ð½InðWðx; pnÞÞ � T 1ðxÞ�Þ; ð13Þ

where fn(x) is some norm function. Next, the cumulativeerror function is updated

Enþ1ðxÞ ¼ ð1� aÞ � EnðxÞ þ a � f ð½InðWðx; pnÞÞ � T 1ðxÞ�Þ;ð14Þ

where a is a small parameter (typically 0.1) which deter-mines the adaptation rate. Eq. (14) is an approximationof the running average, a common practice in the contextof background subtraction (Stauffer and Grimson, 1999).Finally, the robust weights are updated according to

xnþ1ðxÞ ¼ gðEnþ1ðxÞÞ; ð15Þ

where g(x) is a robust estimator. Typically, robust statisticsneeds a scale, which is related to the median of the distribu-tion (Meer et al., 1991). Computing a median for N pixelsrequires a run-time of O(N log(N)) which is undesirable.However, in practice the median can be accurately approx-imated in O(N) time, as discussed in the next section.

We have observed that typically, after a few frames, theaccumulated robust mask xn(x) is already showing a goodsegmentation of the tracked object from the background.Afterwards, any change of appearance of the object, orpartial occlusion, is accurately incorporated into the errorfunction En(x) and the weights xn(x). Thus, all backgroundpixels which have non-negligible amount of texture, arelikely to be masked out by this process. A side effect of thisprocess is that especially strong edges of the object will bemasked out too, as due to quantization error or illumina-tion change, these edges change their location and producea large misalignment error. Note that in our approach,there is no need for an ad hoc normalization of the errorfunction, as discussed in Section 2.2. The only arbitraryconstant is the learning rate a, which will be discussed later.

3.2. Extending the drift-correcting algorithm

As a further step, we propose a robust version of thedrift-correcting algorithm of Matthews et al. (2004). Inorder to distinguish between warping the first template pix-els T1(x) and warping the current template pixels Tn(x) onthe current image In(x), we use a slightly different notationthan used in (Matthews et al., 2004) and in Section 2.3. Thenotation p(n1! n2) will be used to denote the warping

parameters from template T n1ðxÞ to the current image

In2ðxÞ.Thus, given the current image In(x), the first template

T1(x), the current template Tn(x), the cumulative errorfunction En(x), and previous warping parametersp*(0! n � 1) and p*(n � 2! n � 1), the robust drift-correcting algorithm can be summarized as follows:

1. Warp the error function En(x) to fit the current templateTn(x),

EWn ðxÞ ¼ EnðW�1ðx; p�ð0! n� 1ÞÞÞ: ð16Þ

2. Compute the new robust weights for the warped errorfunction

xWn ðxÞ ¼ gðEW

n ðxÞÞ: ð17Þ

3. Use the new robust algorithm (Section 3.1) with weightsxW

n ðxÞ to track In(x) with template Tn(x), using theinitial guess p*(n � 2! n � 1), to obtain the warpingparameters p(n � 1! n).

4. Combine p*(0! n � 1) and p(n � 1! n) to obtain aninitial guess for the warping parameters p(0! n).

5. Compute the new robust weights for the error function

xnðxÞ ¼ gðEnðxÞÞ: ð18Þ

6. Use the new robust algorithm with weights xn(x) totrack In(x) with template T1(x), using the initial guessp(0! n), to obtain the warping parameters p*(0! n).

7. Combine p*(0! n � 1) and p*(0! n) to obtain theupdated warp parameters p*(n � 1! n).

8. Warp In(x) to obtain the temporary error function,

fnðxÞ ¼ f ð½InðWðx; p�ð0! nÞÞÞ � T 1ðxÞ�Þ: ð19Þ

9. Update the cumulative error function for the next frame

Enþ1ðxÞ ¼ ð1� aÞ � EnðxÞ þ a � fnðxÞ: ð20Þ

The algorithm described above is similar to the drift-cor-recting algorithm of Matthews et al. (2004). However, ituses robust weights that are being updated from frame toframe. Granted that the median of the cumulative errorfunction En(x) can be sufficiently approximated in lineartime, the asymptotic complexity of the two algorithms isthe same.

4. Comparison and evaluation

We now present both qualitative and quantitative com-parisons between the drift-correcting algorithm of Mat-thews et al. (2004) and the extended robust version whichwe have proposed. Our data consists of video sequencescaptured with a moving camera, tracking the leading vehi-cle. We have tracked the back-part of vehicles, assumingthe transformation of Eq. (1). We have used the absolutedifference error norm,

fnðxÞ ¼ jInðWðx; pnÞÞ � T 1ðxÞj: ð21Þ

D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491 1487

In addition, we have used a simplified scheme, namelybinary weights, derived from the following simple robustestimator:

xnðxÞ ¼1; if EnðxÞ 6 medianðEnðxÞÞ � 1:4826;

0; otherwise:

(ð22Þ

The factor 1.4826 is introduced for consistent estimation ofthe robust standard deviation in the presence of Gaussiannoise (Meer et al., 1991). The algorithm was implementedin C, including the IPP library, achieving ultra real-timeperformance on a 3.00 GHz Pentium. In particular, anapproximation of the median was efficiently implemented(linear in the number of pixels) using a multi-scale cumula-tive histogram.

Fig. 1. A qualitative comparison of the robust and non-robust drift-correctinentire sequence. The non-robust algorithm is distracted by a pedestrian crossingmoment on. The small image on the left is the zoomed bounding box, while fomask extracted from the accumulated error function.

4.1. Qualitative evaluation

4.1.1. First example

In Fig. 1, some sample frames are shown from a videosequence containing 200 frames. While the leading vehicleis being tracked, a pedestrian is crossing the road, consid-erably occluding the vehicle. The left column of Fig. 1shows the result of using the drift-correcting algorithm ofMatthews et al. (2004), while the right column shows ourrobust version. The small image attached to each frameon the bottom left side shows the current templateenlarged. The small image attached to the bottom right sideon the right column shows the binary mask used by ourrobust algorithm. We have manually initialized the firsttemplate as to be bigger than the vehicle itself by about

g algorithms. The robust algorithm tracks the vehicle correctly across thethe road, which partially occludes the vehicle, and starts to drift from that

r the robust algorithm, the small image on the right shows also the binary

1488 D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491

10 pixels in each dimension. This increased the number oftemplate pixels by 40%, adding more background pixels.

Using the original drift-correcting algorithm, as thepedestrian enters the bounding box in frame 95 the track-ing of the first template does not converge. The templatestarts to drift, and by the time the pedestrian leaves thebounding box, in frame 107, the template has considerablyexpanded and contains much more background. By frame200, the template is about four times larger than the vehicleitself. In contrast, with our robust algorithm, the pedes-trian does not disturb the tracking at all. The alignmentof the template is accurate along the entire scene and thereare no convergence failures.

We have found the robust algorithm on this sequence, aswell as in others, to be highly insensitive to the value of a. Itworked successfully for any value in the range 0.05 6 a 6

Fig. 2. Qualitative comparisons of the robust and non-robust drift-correctingThe robust algorithm tracks the vehicle correctly across 141 frames. The non-rothat moment on. The small image on the left is the zoomed bounding box, wbinary mask extracted from the accumulated error function.

0.95. As the right column of Fig. 1 shows, the binary maskused is able to segment successfully the object beingtracked. In addition, it masks out some problematicregions in the vehicle itself, such as reflections on the rearwindow.

4.1.2. Second example

Our next example, in Fig. 2, deals with an extremelyproblematic video sequence. It manifests low contrastand heavy shadows. The tracked vehicle is relatively closeto the camera and is driving on a curved road, all addingto the variations in appearance. In addition, we have man-ually initialized the first template adding 23% extra in size.As Fig. 2 shows, the original drift-correcting algorithm isable to track correctly until frame 27. From now on, it failsconstantly to align the first template and drift is being build

algorithms under extreme conditions of low contrast and heavy shadows.bust algorithm fails to correct the drift at frame 27, and starts to drift fromhile for the robust algorithm, the small image on the right shows also the

Fig. 3. (a) First frame of the video sequence. The smaller bounding box is the ground truth, but the algorithms were initialized with the bigger boundingbox. (b) Comparison of accuracy (overlap error in percentage w.r.t. the ground truth bounding boxes). (c) Comparison of the number of iterations neededto align the first template with the current image. (d) Comparison of run-time in milliseconds to align the first template with the current image.

D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491 1489

up. On frame 53 tracking was terminated as even the align-ment of frame 52 to frame 53 did not converge. On theother hand, the robust version is able to track correctlyuntil frame 142, without any convergence failures. As canbe seen in Fig. 2, the binary mask segments the vehiclequite nicely from the background, while also masking outthe somewhat problematic bumper.

4.2. Quantitative evaluation

Next, we estimate how much more robust is the drift-correcting tracking if we use a robust weights accumulatedover time. For quantitative comparison between the non-robust and robust versions of the drift-correcting algo-rithm, we take another sequence, of 201 frames long, whichis less demanding, so that both algorithms can perform wellin the sense that the iterations converge along the entiresequence. The ground truth bounding boxes for the wholesequence were acquired using the Lucas–Kanade trackingand hand initialization. The ground truth bounding boxesaccurately match with the outer contour of the vehiclebeing tracked.

Next, we have initialized the experiment with a bound-ing box which is considerably bigger then the ground truthbounding box (25% larger in area). Both bounding boxesare shown in Fig. 3a. To make things even more difficult,and to give both algorithms a fair chance with every newframe independently, the initial guess for the warp param-

eters p*(n � 1! n) were reset to equal zero for all frames.On the other hand, the parameters p*(0! n) obtained byboth algorithms were used to project the ground truthbounding box T1(x) to the current image, and compare itagainst the current ground truth bounding box Tn+1(x).The comparison between the projected templateT1(W(x;p*(0! n))) and Tn+1(x) is performed by comput-ing the overlap error in percent, as shown in Fig. 3b. Ascan be seen from Fig. 3b, the error is significantly largerin the non-robust case (about six times larger on average)and the ratio between the two errors consistently increaseswith time.

In Fig. 3c, the number of iterations needed to align T1(x)with In(x) is shown for the whole sequence, for both algo-rithms. Fig. 3d shows the run-time in milliseconds for bothalgorithms. As can be seen from these plots, the number ofiterations required by the robust algorithm is considerablylower than in the case of the original drift-correcting algo-rithm (about third on average). On the other hand, therobust algorithm is indeed slower (40% slower on average).However, we note that the robust algorithm is still extre-mely fast (4 milliseconds tracking time for an object thatcontains 3000 pixels on a 3.00 GHz PC).

5. Conclusions

We have introduced a novel robust extension of theLucas–Kanade template matching algorithm, in particular,

1490 D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491

a robust extension of the inverse compositional algorithmwhich is a more efficient version. In our approach, theweights are computed from evidence which is being accu-mulated over many frames. We have presented as well arobust extension of the drift-correcting algorithm of Mat-thews et al. (2004). We have performed qualitative compar-ison between the original and the robust drift-correctingalgorithms, and demonstrated that under significant partialocclusion or difficult illumination conditions, the non-robust algorithm fails often to align the first templateand consequently suffers from drift, while our robust algo-rithm tracks successfully. Finally, we have performed aquantitative evaluation and have shown that the trackingaccuracy of the robust version is far better than that ofthe non-robust. Although the robust version is indeedslower, it is nevertheless extremely fast, achieving, forexample, a rate between 210 and 360 frames per secondon the sequence analysed in Fig. 3.

We have outlined the limitations of previous robustalgorithms. The exact modified weights algorithm (Ishi-kawa et al., 2002) is not efficient, due to the fact that theHessian matrix depends on the updated weights and needsto be re-computed during iterations. Therefore, a trade-offbetween robustness and efficiency needs to be made, eitherby dividing the template into small number of blocks,assuming a constant weight for each block, or by eliminat-ing the weight dependency from the Hessian (similarly tothe modified residuals algorithm of Hager and Belhumeur,1998). In addition, the assignment of weights on the firstiteration might mask out dominant parts of the objectswhile at the same time not masking dominant parts ofthe background. This fact might cause the gradient descentalgorithm to converge to a local minimum. Thus, theadvantages of our robust algorithm are two-fold. First,no compromise of robustness is needed in order to achieveefficiency. Second, as the errors are accumulated over manyframes, a clear segmentation of object and background isachieved. We plan to perform a comparative evaluationof all these robust algorithms in the future.

Nevertheless, if one wishes to achieve even a higherspeed, we propose two possible approximations to ourrobust drift-correcting algorithm. The first variant is ahybrid algorithm where the tracking of template Tn(x) inimage In(x) is performed non-robustly, i.e. using the origi-nal inverse compositional algorithm, while the tracking ofthe first template T1(x) in image In(x) is done using ourrobust version. The second variant of the algorithm usesrobust weights in both tracking of T1(x) and Tn(x). How-ever, the robust estimation of the weights xn(x) from theerror function En(x) is done only once, and the robustweights for the tracking of Tn(x) are obtained directly bywarping xn(x). Preliminary evaluation shows that indeedboth approximated algorithms achieve better efficiency onthe expanse of their accuracy.

There are two important issues that are left as futurework. The first one concerns the case where the trackedobject becomes much smaller or larger than the original

template. In (Dedeoglu et al., 2006), it was noted that whenobjects appear small in comparison to the original tem-plate, they need to be enlarged through interpolation whilebeing warped back onto the coordinate frame of the tem-plate, and this reliance on interpolation degrades perfor-mance. In order to better count for low resolution data,the solution in (Dedeoglu et al., 2006) is to average insteadthe pixel intensities of the original template. Such consider-ations are even more involved in our robust drift-correctingalgorithm, as the error function En(x) is additionally beingwarped forward onto the coordinate frame of the currenttemplate, Tn(x). Thus a similar problem occurs when thetracked object becomes much larger than the originaltemplate.

Finally, the parameter a (Eq. (14)) which controls thelearning rate of the error function En(x), was manuallyset to 0.1 during our experiments. Although we haveobserved that our robust algorithm is highly non-sensitiveto the value of a, nevertheless we would like to investigatethis point further and in particular to enable a dynamicalsetting of a.

References

Baker, S., Matthews, I., 2004. Lucas–Kanade 20 Years on: A unifyingframework. Internat. J. Comput. Vision 56 (3), 221–255.

Baker, S., Gross, R., Matthews, I., 2003. Lucas–Kanade 20 years on: Aunifying framework: Part 2, Technical Report CMU-RI-TR-03-01,Carnegie Mellon University Robotics Institute.

Bergen, R., Anandan, P., Hanna, K.J., Hingorani, R., 1992. Hierarchicalmodel-based motion estimation. ECCV, pp. I: 5–10.

Beymer, D., McLauchlan, P.F., Coifman, B., Malik, J, 1997. A real-timecomputer vision system for measuring traffic parameters. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).

Black, M.J., Jepson, A.D., 1998. Eigentracking: Robust matching andtracking of articulated objects using a view-based representation. Int.J. Comp. Vision 26 (1), 63–84.

Bradski, G., 1998. Computer vision face tracking as a component of aperceptual user interface, In: Proc. Workshop Applications ComputerVision, pp. 214–219.

Comaniciu, D., Visvanathan, R., Meer, P., 2003. Kernel based objecttracking. IEEE Trans. Pattern Anal. Machine Intell. 25 (5), 564–575.

Cootes, T., Edwards, G., Taylor, C., 2001. Active appearance models.IEEE Trans. Pattern Anal. Machine Intell. 23 (6), 681–685.

Dedeoglu, G., Kanade, T., Baker, S., 2006. The asymmetry of imageregistration and its application to face tracking, technical reportCMU-RI-TR-06-06, Carnegie Mellon University Robotics Institute.

Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking withparametric models of geometry and illumination. IEEE Trans. PatternAnal. Machine Intell. 20 (10), 1025–1039.

Huber, P.J., 1981. Robust Statistics. Wiley-Interscience.Isard, M., Blake, M., 1998. Condensation: Conditional density propaga-

tion for visual tracking. Internat J. Comput. Vision (IJCV) 29 (1), 5–28.

Ishikawa, T., Matthews, I., Baker, S., 2002. Efficient image alignment withoutlier rejection, technical report CMU-RI-TR-02-27, Carnegie Mel-lon University Robotics Institute.

Lucas, B., Kanade, T., 1981. An iterative image registration techniquewith an application to stereo vision. In: Proc. Internat. Joint Conf. onArtificial Intelligence, pp. 674–679.

Matthews, I., Ishikawa, T., Baker, S., 2004. The template update problem.IEEE Trans. Pattern Anal. Machine Intell. 26 (6), 810–815.

D. Schreiber / Pattern Recognition Letters 28 (2007) 1483–1491 1491

Meer, P., Mintz, D., Rosenfeld, A., 1991. Robust regression methods forcomputer vision: A Review. Internat. J. Comput. Vision 6 (1), 59–70.

Sclaroff, S., Isidoro, J., 1998. Active blobs. In: Proc. 6th Internat. Conf.Computer Vision, pp. 1146–1153.

Smith, S.M., Brady, J.M., 1995. Asset-2: Real-time motion segmentationand shape tracking. IEEE Trans. Pattern Anal. Machine Intell. 17 (8),814–820.

Stauffer, C. and Grimson, W., 1999. Adaptive background mixture modelsfor real-time tracking, In: Proc. CVPR, IEEE II:246–252.

Yokoyama, M., Poggio, T., 2005. A contour-based moving objectdetection and tracking. In: Second Joint IEEE Internat. Workshopon Visual Surveillance and Performance Evaluation of Tracking andSurveillance.