generalizing the lucas–kanade algorithm for histogram-based tracking

10
Generalizing the Lucas–Kanade algorithm for histogram-based tracking David Schreiber * Smart System Division, Austrian Research Centers GmbH – ARC, Donau-City-Strasse 1, A-1220 Vienna, Austria Received 27 July 2007 Available online 18 January 2008 Communicated by B. Kamgar-Parsi Abstract We present a novel histogram-based tracking algorithm, which is a generalization of the template matching Lucas–Kanade algorithm (and in particular of the inverse compositional version which is more efficient). The algorithm does not make use of any spatial kernel. Instead, the dependency of the histogram on the warping parameters is introduced via a feature kernel. This fact helps us to overcome several limitations of kernel-based methods. The target is represented by a collection of patch-based histograms, thus retaining spatial information. A robust statistics scheme assigns weights to the different patches, rendering the algorithm robust to partial occlusions and appearance changes. We present the algorithm for 1-D histograms (e.g. gray-scale), however extending the algorithm to handle higher dimensional histograms (e.g. color) is straightforward. Our method applies to any warping transformation that forms a group, and to any smooth feature. It has the same asymptotic complexity as the original inverse compositional template matching algorithm. We pres- ent experimental results which demonstrate the robustness of our algorithm, using only gray-scale histograms. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Template tracking; The Lucas–Kanade algorithm; Histogram-based tracking; Kernel-based tracking; Robust least squares; Pedestrian trac- king 1. Introduction Tracking can be defined simply as follows: given a cur- rent frame of a video and the location of an object in the previous frame, find its location in the current frame. The three main categories into which most algorithms fall are feature-based tracking (e.g. Beymer et al., 1997; Smith and Brady, 1995), contour-based tracking (e.g. Isard and Blake, 1998; Yokoyama and Poggio, 2005), and region- based tracking. In the last category, the region’s content is used either directly (template matching (e.g. Lucas and Kanade, 1981), or is represented by a non-parametric description such as a histogram (e.g. Perez et al., 2002; Adam et al. (2006)), and most notably, kernel-based track- ing using the mean shift algorithm (e.g. Bradski, 1998; Comaniciu et al. (2003)). Kernel-based methods track an object region repre- sented by a spatially weighted intensity histogram. An object function that compares target and candidate kernel densities is formulated using the Bhattacharya measure, and tracking is achieved by optimizing this objective func- tion using the iterative mean shift algorithm (e.g. Comani- ciu et al., 2003). However, first introduced kernel-based approaches were restricted to visual tracking problems involving only location (Comaniciu et al., 2003) and loca- tion and scale (Collins, 2003). Additional limitations of these methods include: (1) slow convergence rate of the mean shift optimization; (2) the interaction between the spatial structure of the kernel and of the image. This causes, for example, Hager et al. (2004) and Fan et al. (2005) to carefully choose a small number of multiple 0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.12.014 * Tel.: +43 (0) 50550 4282; fax: +43 (0) 50550 4150. E-mail address: [email protected] www.elsevier.com/locate/patrec Available online at www.sciencedirect.com Pattern Recognition Letters 29 (2008) 852–861

Upload: david-schreiber

Post on 29-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Available online at www.sciencedirect.com

www.elsevier.com/locate/patrec

Pattern Recognition Letters 29 (2008) 852–861

Generalizing the Lucas–Kanade algorithmfor histogram-based tracking

David Schreiber *

Smart System Division, Austrian Research Centers GmbH – ARC, Donau-City-Strasse 1, A-1220 Vienna, Austria

Received 27 July 2007Available online 18 January 2008

Communicated by B. Kamgar-Parsi

Abstract

We present a novel histogram-based tracking algorithm, which is a generalization of the template matching Lucas–Kanade algorithm(and in particular of the inverse compositional version which is more efficient). The algorithm does not make use of any spatial kernel.Instead, the dependency of the histogram on the warping parameters is introduced via a feature kernel. This fact helps us to overcomeseveral limitations of kernel-based methods. The target is represented by a collection of patch-based histograms, thus retaining spatialinformation. A robust statistics scheme assigns weights to the different patches, rendering the algorithm robust to partial occlusions andappearance changes. We present the algorithm for 1-D histograms (e.g. gray-scale), however extending the algorithm to handle higherdimensional histograms (e.g. color) is straightforward. Our method applies to any warping transformation that forms a group, and toany smooth feature. It has the same asymptotic complexity as the original inverse compositional template matching algorithm. We pres-ent experimental results which demonstrate the robustness of our algorithm, using only gray-scale histograms.� 2008 Elsevier B.V. All rights reserved.

Keywords: Template tracking; The Lucas–Kanade algorithm; Histogram-based tracking; Kernel-based tracking; Robust least squares; Pedestrian trac-king

1. Introduction

Tracking can be defined simply as follows: given a cur-rent frame of a video and the location of an object in theprevious frame, find its location in the current frame. Thethree main categories into which most algorithms fall arefeature-based tracking (e.g. Beymer et al., 1997; Smithand Brady, 1995), contour-based tracking (e.g. Isard andBlake, 1998; Yokoyama and Poggio, 2005), and region-based tracking. In the last category, the region’s contentis used either directly (template matching (e.g. Lucas andKanade, 1981), or is represented by a non-parametricdescription such as a histogram (e.g. Perez et al., 2002;Adam et al. (2006)), and most notably, kernel-based track-

0167-8655/$ - see front matter � 2008 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2007.12.014

* Tel.: +43 (0) 50550 4282; fax: +43 (0) 50550 4150.E-mail address: [email protected]

ing using the mean shift algorithm (e.g. Bradski, 1998;Comaniciu et al. (2003)).

Kernel-based methods track an object region repre-sented by a spatially weighted intensity histogram. Anobject function that compares target and candidate kerneldensities is formulated using the Bhattacharya measure,and tracking is achieved by optimizing this objective func-tion using the iterative mean shift algorithm (e.g. Comani-ciu et al., 2003). However, first introduced kernel-basedapproaches were restricted to visual tracking problemsinvolving only location (Comaniciu et al., 2003) and loca-tion and scale (Collins, 2003). Additional limitations ofthese methods include: (1) slow convergence rate of themean shift optimization; (2) the interaction between thespatial structure of the kernel and of the image. Thiscauses, for example, Hager et al. (2004) and Fan et al.(2005) to carefully choose a small number of multiple

D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861 853

kernels for each specific sequence they process; (3) loss ofspatial information, and (4) inability to handle occlusionsdue to the global nature of the template model.

These limitations were addressed by many consequentworks. For example, Hager et al. (2004) and Georgescuand Meer (2004) replace the mean-shift optimization ofthe Bhattacharyya measure by Newton-style iterations ofan SSD-like measure (the Matusita metric) and allow formore general transformations than translation and scale.In (Hager et al., 2004) it is shown that the latter optimiza-tion method is more efficient than the former, and makesfewer assumptions on the form of the underlying kernelstructure. However, as the binning function of the histo-gram in (Hager et al., 2004) is a binary function, the analyt-ical solution of the SSD minimization is cumbersome andrequires the introduction of a large sifting matrix (numberof pixels in the region times the number of bins in thehistogram). In (Georgescu and Meer, 2004), a non-binaryfeature kernel is used, and minimization is done using aniterative weighted least-squares. Furthermore, to enhancethe localization accuracy, optical flow-based registrationis employed too, and both estimations are combined intoa single estimation process using the sum of the two. How-ever, both Georgescu and Meer (2004) and Hager et al.(2004) are still making use of a global spatial kernel.

The issue of the loss of spatial information has beentreated in various forms. In (Elgammal et al., 2003), a jointfeature-spatial distribution is used which takes intoaccount both the intensities and their position in the win-dow, where a local spatial domain kernel rather than a glo-bal one is used. The tracking is achieved by maximizinglikelihood using mean shift. In (Yang et al., 2005), a newsimilarity measure for this joint distribution is introduced,which is the expectation of the density estimates over themodel or target image. To alleviate the quadratic complex-ity, the improved fast Gauss transform is used. The newsimilarity measure allows the man shift algorithm to trackmore general motion models. In (Birchfield and Rangara-jan, 2005), the notion of spatiograms has been introduced,by adding the spatial mean and covariance of the pixelpositions to the histogram and then employing mean shiftfor spatiograms. In (Zhao and Tao, 2005), correlogram isused and the mean shift algorithm is extended to 3D (loca-tion and orientation of the correlogram).

A different approach to histogram-based tracking wasintroduced by Perez et al. (2002). Based on color histogramdistances, color likelihood is build and then coupled with adynamical state space model. The resulting posterior distri-bution is sequentially approximated with a particle filter.This probabilistic approach is further extended to patch-based color modeling to incorporate the spatial layout.

Recently, a new histogram-based tracking approachcalled Frag-Track (Adam et al., 2006), was introduced,which is not based on an optimization scheme but ratheron an exhaustive search (over translations and scales)which is made efficient by using integral histograms. Thetemplate object is represented by multiple histograms of

multiple rectangular patches of the template. Every patchvotes on the possible positions and scales of the object inthe current frame, by comparing its histogram with the cor-responding image patch histogram. Next, a robust statisticmeasure is minimized in order to combine the vote maps ofthe multiple patches. The advantages of this method overoptimization-based techniques, is that first it allows theuse of any metric for comparing two histograms, and notjust analytically tractable ones, and second, that it is lesslikely to be stuck on a local minima. However, as themethod relies on the use of integral histograms, the numberof bins used is limited and on colour images the methodcan become quite memory-consuming. Limited accuracyis another issue. The search over different positions andscales is discrete, not to mention sub-pixel accuracy associ-ated with continuous optimisation schemes. In addition,the method is limited to a transformation consisting oftranslations and scale.

Template tracking, on the other hand, dates back toLucas and Kanade (1981). The goal of the Lucas–Kanadealgorithm is to minimize the sum of squared error betweenthe template and a new image warped back onto the coor-dinate frame of the template (Baker and Matthews, 2004).The minimization is performed with respect to the warpingparameters. Due to its non-linear nature, optimization isdone iteratively solving for increments to the alreadyknown warping parameters. In particular, the inverse com-

positional algorithm (Baker and Matthews, 2004) is a moreefficient version of the algorithm, where the roles of thetemplate and the image are switched and as a result, theHessian need not be updated each iteration.

To handle partial occlusions, appearance variations andpresence of background pixels, robust versions of the tem-plate matching algorithm were proposed (e.g. Hager andBelhumeur, 1998; Ishikawa et al. (2002)). The goal of therobust algorithms is to use a weighted least-squares pro-cess, such that occluded regions, background pixels andregions where brightness have changed would be consid-ered as outliers and would be suppressed. In practice,robust algorithms require a trade-off between efficiencyand accuracy. Namely, in (Hager and Belhumeur, 1998),the Hessian matrix depends on outliers, while in (Ishikawaet al., 2002), the template is divided into patches, assuminga constant weight for each patch.

In this paper we introduce a novel optimization-basedalgorithm for histogram-based tracking which is a general-ization of the Lucas–Kanade algorithm. In particular, weformulate it as an inverse compositional algorithm whichis more efficient (Baker and Matthews, 2004). We removealtogether any spatial kernel, external or local, from ourhistogram definition, and introduce the warping parame-ters directly into a feature kernel. This fact helps us to over-come the limitations of kernel-based methods mentionedabove. A fast convergence rate of the optimization isachieved by using a Gauss–Newton gradient descent; byavoiding the use of a spatial kernel altogether, the structureof the kernel becomes irrelevant; spatial information is kept

854 D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861

by dividing the region of interest into (overlapping)patches: each patch is represented by a histogram, andthe relative order of the patches maintains the spatial infor-mation; occlusions are handled by maintaining a robuststatistics over the patches and suppressing patches whoseindividual histogram has changed too much. We presentthe algorithm for gray-scale histograms; however, extend-ing the algorithm to handle higher dimensional histogramsis straightforward. Our method applies to any warpingtransformation that forms a group, and to any smooth fea-ture. It has the same asymptotic complexity as the originalinverse compositional template matching algorithm.

The rest of the paper is organized as follows. In Section2 we review the formulation of the Lucas–Kanade templatematching tracking algorithm. In Section 3 we introduce ournew histogram-tracking approach, for the 1D histogramcase. We then present a robust extension using patchesand robust weights. Section 4 contains implementationdetails and experimental results. Section 5 presents conclu-sions and discussion.

2. Review of the Lucas–Kanade template matching algorithm

2.1. The inverse compositional algorithm

In the rest of this paper, we follow the notations of Mat-thews et al. (2004) as closely as possible. Let In(x) stand forthe nth image in a given video sequence, where x = (x,y )T

are the pixel coordinates and n = 0,1,2, . . . is the framenumber. The template T(x) is extracted from the initialframe. The warp W(x;p) takes the pixel x in the coordinateframe of the template T(x) and maps it to a sub-pixel loca-tion, W(x;p), in the coordinate frame of the image In(x),where p = (p1, . . . ,pk)T is a vector of parameters. For exam-ple, if the object is roughly planar and is parallel to theimage plane, translating but not rotating, one can use thefollowing 2D image warp with three parameters,p = (S,Tx,Ty)T:

Wðx; pÞ ¼ Sx

y

� �þ

T x

T y

� �: ð1Þ

In the original Lucas–Kanade algorithm (Lucas and Ka-nade, 1981), the best match to the template in a new frameis found by minimizing the following SSD function, wherethe summation is over all pixels of the template:Xx2T

InðWðx; pÞÞ � T ðxÞ½ �2: ð2Þ

The minimization is performed with respect to the warpingparameters p. Due to its non-liner nature, optimization isdone iteratively solving for increments Dp to the alreadyknown parameters pXx2T

InðWðx; pþ DpÞÞ � T ðxÞ½ �2: ð3Þ

The inverse compositional algorithm (Baker and Mat-thews, 2004) is a more efficient version of the algorithm,

where the roles of the template and the new image areswitched. In this case, the following expression is to be iter-atively minimized:Xx2T

InðWðx; pÞÞ � T ðWðx; DpÞÞ½ �2: ð4Þ

Performing a first order Taylor expansion on Eq. (4) gives:

Xx2T

InðWðx; pÞÞ � T ðxÞ � rToW

opDp

� �2

: ð5Þ

Minimizing Eq. (5) is a least-squares problem. The closedform solution is obtained by taking the partial derivativeof Eq. (5) and then setting it to equal zero. The solution ob-tained is (Baker and Matthews, 2004):

Dp ¼ H�1Xx2T

rToW

op

� �T

½InðWðx; pÞÞ � T ðxÞ�; ð6Þ

where H is the Gauss–Newton approximation to the Hes-sian matrix:

H ¼Xx2T

rToW

op

� �T

rToW

op

� �; ð7Þ

which does not depend on the warping parameters and thuscan be pre-computed.

2.2. The iteratively updated robust algorithm

In (Ishikawa et al., 2002; Baker et al., 2004), the robust(modified weights) inverse compositional algorithm isderived by minimizingXx2T

qð½IðWðx; pÞÞ � T ðWðx; DpÞÞ�2Þ; ð8Þ

where q is a robust estimator (Huber, 1981). The weightedleast-squares solution is a generalization of Eq. (6) (Bakeret al., 2004):

Dp ¼ H�1q

Xx2T

q0ð½InðWðx; pÞÞ

� T ðxÞ�2Þ rToW

op

� �InðWðx; pÞÞ � T ðxÞ½ �; ð9Þ

where the robust Hessian matrix is

Hq ¼Xx2T

q0ð½IðWðx; pÞÞ � T ðxÞ�2Þ rToW

op

� �T

rToW

op

� �:

ð10Þ

However, the Hessian matrix contains a weighting functionwhich is updated every iteration and thus cannot be pre-computed. The solution suggested in (Ishikawa et al.,2002) is to subdivide the template into a set of patches.Based on the spatial coherence of the outliers, the weightis assumed to be constant on each patch. Assuming thenumber of patches to be much smaller than the numberof pixels in the template, the complexity of re-computingthe Hessian becomes insignificant. A different robust

D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861 855

algorithm (the modified residuals) was proposed in (Hagerand Belhumeur, 1998), which is similar to but slightly dif-ferent from the inverse compositional algorithm. In this ap-proach, the Hessian does not depend on the weightingfunction and need not be re-computed each iteration.

3. The generalized Lucas–Kanade algorithm for histogram-

based tracking

3.1. Definition of the histogram

Previous tracking methods which consider the spatiallayout of the object but avoid the use of an external globalspatial kernel, either model the object based on patch-his-tograms (e.g. Perez et al., 2002; Adam et al., 2006), oruse a joint feature-spatial distribution (e.g. Elgammalet al., 2003; Yang et al., 2005), as follows:

hðx; uÞ ¼ 1

N

Xðx0;u0Þ

Kjðx0 � xÞGrðu0 � uÞ ð11Þ

where N is the size of the sample, Kj is a 2D kernel withbandwidth j and Gr is a d-dimensional kernel with a band-width r. The local spatial kernel Kj relaxes the rigidity con-straint of a template matching tracker and allows for smalllocal deformations. However, if the feature u is d-dimen-sional, the histogram h(x,u) has a dimension of d + 2,and the number of bins in the histogram (Eq. (11)) is equalto the number of pixels in the region multiplied by thenumber of feature bins. This representation is too costly,compared to template matching (number of pixels), or toconventional kernel-based trackers (number of featurebins). As the binning used in the feature space is usuallycoarse, e.g. 16 bins for grey-scale images, the solution to re-duce complexity would be to use coarser binning in the spa-tial domain too. However, computing a histogram for eachpatch individually and keeping a collection of sub-histo-grams is preferable over using the joint-feature histogramsuch as Eq. (11), as the former is more flexible, allowingto use overlapping patches or to ignore patches which donot contain enough structure.

Given a region in the image, we define its histogram as

hðuÞ ¼X

x

Grðf ðxÞ � uÞ; ð12Þ

where summation is over all pixels in that region, andwhere f(x) is a d dimensional feature vector that character-izes the appearance within some neighbourhood of pixel x.f(x) is assumed to be a dense and smooth feature, e.g.intensity or colour. Gr(f(x) � u) is a differentiable kernelfunction with respect to the feature f(x), having a band-width r. In other words, it is a voting function which deter-mines the probability that the pixel x = (x,y) belongs to thebin whose centre is located at u. When the region under-goes a warp, W(x;p), the warping affects the values of x,and as a result the values of the feature f(x) and, conse-

quently, the value of h(u). Hence, the content of the histo-gram at bin u depends on the warping parameters p as well:

hðu; pÞ ¼X

x

Grðf ðWðx; pÞÞ � uÞ: ð13Þ

When the region of interest is divided into K patches, Pi,i = 1,2, . . . ,K, the histogram for patch i is defined as

hiðu; pÞ ¼Xx2P i

Grðf ðWðx; pÞÞ � uÞ: ð14Þ

The histogram representing the whole region is simply aconcatenation of all the patch-histograms

hðu; pÞ ¼ ðh1ðu; pÞ; h2ðu; pÞ; . . . ; hKðu; pÞÞ: ð15ÞFor clarity sake, we derive our new histogram-based track-er using only one dimensional feature, u.

3.2. Histogram-based tracking using a global histogram

In the spirit of the Lucas–Kanade algorithm, our aim isto minimize the following expression, i.e. iteratively solvingfor increments Dp to the already known parameters p:

Xm

u¼1

½h1ðuÞ � h2ðu; Wðx; pþ DpÞÞ�2; ð16Þ

where h1 and h2 correspond, respectively, to the referenceand the new image histograms, and the sum is over all m

bins of the histograms. To render the algorithm more effi-cient, the following expression is to be minimized instead(the inverse compositional algorithm):

Xm

u¼1

½h1ðu; Wðx; DpÞÞ � h2ðu; Wðx; pÞÞ�2: ð17Þ

Performing a first order Taylor expansion on Eq. (17)yields:

Xm

u¼1

h1ðu; Wðx; 0ÞÞ þ oh1ðu; Wðx; 0ÞÞop

Dp� h2ðu; Wðx; pÞÞ� �2

:

ð18Þ

Without loss of generality we assume that W(x;0)is theidentity warp. From Eq. (13) we find that the derivativeof the reference histogram with respect to the warpingparameter is

oh1ðuÞop

¼X

x

oGrðf ðxÞ � uÞof

rfoW

op: ð19Þ

The interpretation of Eq. (19) is straightforward: Updatingthe warping parameters by p p + Dp changes the featurevalue by f ðxÞ f ðxÞ þ rf oW

opDp. This, in turn, changes

the contribution of pixel x to the bin u byGrðf ðxÞ � uÞ Grðf ðxÞ � uÞ þ oG

of rf oWop

Dp. Hence, h1(u)is updated according to Eq. (19).

The closed form solution to the least-squares problemEq. (18) is obtained similarly to Eq. (6):

856 D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861

Dp ¼ H�1Xm

u¼1

Xx

oGofrf

oW

op

" #T

½h2ðu; Wðx; pÞÞ � hðuÞ�;

ð20Þ

where H is the Gauss–Newton approximation to the Hes-sian matrix, and the Jacobian oW

opis evaluated at (x;0):

H ¼Xm

u¼1

Xx

oGofrf

oW

op

" #T Xx

oGofrf

oW

op

" #: ð21Þ

In this formulation, the Hessian does not depend on p, it isconstant across iterations and can be pre-computed. Aswith the inverse compositional version of the templatetracking algorithm, the optimization is done with respectto Dp in each iteration, and then the warp is updatedaccording to

Wðx; pÞ Wðx; pÞ �Wðx; DpÞ�1: ð22Þ

The inverse compositional algorithm consists of iterativelyapplying Eqs. (20) and (22).

Let n, N and m designate the number of warping param-eters, the number of pixels in the template and the numberof bins in the histogram, respectively. Our histogram-basedalgorithm has some additional steps compared with the ori-ginal template tracking algorithm, namely computing thefeature template f(x) from the original grey-value template,and computing the histograms h1 and h2 (O(N) complexity)and also computing the histograms oh1(u)/op (O(nN) com-plexity). On the other hand, other operations depend onthe number of bins rather than on the number of pixelsand hence are faster. Overall, the asymptotic complexityis identical, namely O(n2N) for pre-computation andO(nN + n3) per iteration.

3.3. Histogram-based tracking using patch-based histograms

In the previous section we have shown how the Lucas–Kanade template-based algorithm can be extended to his-togram-based tracking, using a global histogram. Next,we extend this derivation to a robust algorithm whichmakes use of multi patch-based histograms. Suppose thetemplate is divided into K patches, where patch Pi containsNi pixels, and has the following histogram (Eq. (14)):

hiðu; pÞ ¼Xx2P i

Grðf ðWðx; pÞÞ � uÞ: ð23Þ

Following the same argument as in the previous section,our goal is to minimize the following SSD expression:

XK

i¼1

Xm

u¼1

wðiÞ hi1ðuðWðx; DpÞÞÞ � hi

2ðuðWðx; pÞÞÞ� �2

; ð24Þ

where the left summation is over the patches and the rightone is over bins of an individual patch-histogram. Eachpatch Pi is weighted by w(i). The weights can be used tosuppress outliers, i.e. patch-histograms which correspondto occluded regions, background regions and regions

whose brightness has markedly changed. This is done byusing a robust measure on all the K weights, w(i).

The solution to the weighted least-squares problem ofEq. (24) is

Dp ¼ H�1XK

i¼1

wðiÞXm

u¼1

Xx2P i

oGofrf

oW

op

" #T

� hi2ðuðWðx; pÞÞÞ � hi

1ðuðxÞÞ� �

; ð25Þ

where the robust Hessian, H, is given by:

H ¼XK

i¼1

wðiÞXm

u¼1

Xx2P i

oGofrf

oW

op

" #T Xx2P i

oGofrf

oW

op

" #:

ð26ÞWhen the weights w(i) are updated every iteration, as in themodified weights template tracking algorithm of Ishikawaet al. (2002), the patch-based algorithm is almost equallyefficient as the non-robust algorithm (using a global histo-gram), since typically the number of patches K is muchsmaller than the number of pixels N. Our patched-basedalgorithm has the same asymptotic complexity as the ro-bust template tracking algorithm of Ishikawa et al.(2002), namely O(n2N) for pre-computation andO(nN + n3 + Kn2 + KlogK) per iteration. The termO(KlogK) appears since sorting of the weights w(i) is re-quired. In practice, however, the distinction between inlierand outlier patches can be achieved using the median,which can be approximated in O(K) time using multi-scalecumulative histogram (Schreiber, 2007). In case that the ro-bust weights are updated only once per frame, the complex-ity is O(nN + n3) per iteration.

The 1D histogram-based algorithm which was derived inthis section can be easily extended to handle multi-dimen-sional features, e.g. colour, by adding an extra dimensionto the feature f, bin index u and pixel coordinates x.

4. Experiments

To demonstrate the robustness of the proposed algo-rithm, we present some evaluations and comparisons ondifferent scenes, consisting of vehicle, face and pedestriantracking. We are using only grey-scale information, coar-sely binned into 16 bins. The reference collection ofpatch-based histograms is computed from the initial tem-plate and is kept unchanged hereafter. We are employinga warping transformation consisting of 2D translationsand scale. The larger the patches are, the more robust thetracking is under scale change (target getting smaller). Onthe other hand, decreasing the size of the patches rendersthe algorithm more robust to partial occlusions andappearance changes. As a trade-off, the template is dividedinto 8 � 8 overlapping rectangular patches (utilizing amulti-scale approach is discussed in Section 5).

Gr can be represented by a Gaussian function only whenthe bins are fine. However, as the gray-scale values arecoarsely binned, the probability that the feature vector

D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861 857

f(x) belongs to bin u needs to be integrated over the entirebin u:

Grðf ðxÞ � u; DuÞ ¼Z uþ0:5Du

u�0:5Du

1ffiffiffiffiffiffiffiffi2prp e�

ðu0�f ðxÞÞ2

2r2 du0: ð27Þ

To vote only for the two nearest neighbour bins, we taker = 7. In this case, the maximal vote is Gr(0) = 0.75. Whenf(x) coincides with a boundary of a bin, we vote for twobins only, with Gr(±Du/2) = 0.5. To speed up computa-tion, we approximate Gr by the following smooth secondorder polynomial, C(f(x) � u):

Cðf ðxÞ � u; DuÞ ¼0:75� ðf ðxÞ � uÞ2=Du2; jf ðxÞ � uj 6 Du

2;

ðf ðxÞ � uÞ2=2Du2 � 3jf ðxÞ � uj=2Duþ 1:125; Du2< jf ðxÞ � uj 6 Du;

0; otherwise:

8><>: ð28Þ

Fig. 1a shows Grðf ðxÞ � u; DuÞ and its approximation byCðf ðxÞ � u; DuÞ for a neighborhood of 3 bins. We havechosen the following simple scheme for the extraction ofthe robust weights for the patch-histograms. The weightsare updated only once per frame, and are kept constantduring iterations. Given that the tracking in frame n yieldsthe warping parameters p, we compute the average errorfor each patch-histogram:

DhðiÞ ¼ 1

m

Xm

u¼1

jhi1ðuÞ � hi

2ðuðWðx; pÞÞÞj: ð29Þ

The decision regarding outliers for the tracking in framen+1 is done by the following simple robust scheme (notethat in general the weight need not be binary):

wðiÞ ¼1; if DhðiÞ 6 medianðDhÞ � 1:4826;

0; otherwise:

ð30Þ

The factor 1.4826 is introduced for consistent estimationof the robust standard deviation in the presence of Gauss-ian noise (Meer et al., 2004).

In the following we present 3 experiments. First, wecompare between our histogram-based tracker and arobust template matching tracker, on a sequence contain-ing a moving vehicle, occluded by a crossing pedestrian.Next, we use the ‘‘face” sequence presented in (Adamet al., 2006), where a face is undergoing severe occlusions,and compare our results with those presented there forFrag-Track as well as for the mean-shift tracker. Finally,we run our tracker on the sequence ‘‘ThreePastShop2cor”,taken from the CAVIAR datasets,1 to track a pedestrianundergoing appearance changes, occlusions and scalechange, and compare ourselves with Frag-Track.

The implementation details and the parameters men-tioned above were kept identical for all the experiments.The only difference between the first example and the othertwo is that in the former we use stricter convergence criteriaon the warping parameters. The reason is that in the first

1 Caviar datasets: http://groups.inf.ed.ac.uk/vision/CAVIAR/.

experiment, we are tracking a rigid body and thereforeexpect better tracking accuracy. Moreover, as we compareourselves in this example with a template matching tracker,we set the same convergence criteria for both.

4.1. First example: vehicle tracking

In the first example, we compare our new tracker with arobust inverse compositional template matching algorithm,by tracking a rigid body, i.e. a vehicle. The sequence wascaptured from a moving vehicle with the camera looking

forward. Although we track a rigid object, the task is nottrivial: the area of the template drops by a factor of 2.6 dur-ing tracking; the object is relatively small (drops to 1200pixels); there are appearance changes due to strong reflec-tions from the rear window; the initial bounding box con-tains about 25% of background pixels; during the sequence,a pedestrian crossing the road occludes up to additional30% of the vehicle.

First we ran a robust version of the template match-ing algorithm, described in (Schreiber, 2007). This ver-sion uses the drift-correcting algorithm of Matthewset al. (2004), which in addition to the initial template,maintains also the current estimate of the template,and tracks the object using both templates. In the robustextension to this algorithm proposed in (Schreiber, 2007),the robust weights assigned to the pixels in the templateare adaptively updated for each new frame. When exam-ining the results of this algorithm over the entiresequence, it became clear that the results are as accurateas those extracted manually by a human. Hence theresults of this algorithm are regarded as the ground-truth, and the results of our new tracker are comparedto it.

Some frames with the resulting tracked bounding boxsuperimposed are shown in Fig. 2. The pedestrian can beseen approaching the vehicle, occluding it for some framesthen departing. As can be seen from Fig. 1b, the positionaccuracy of the bounding box relative to the ground truthis within sub-pixel accuracy, except for the critical momentof occlusion by the pedestrian, where it raises up to 2 pixels.Similarly, Fig. 1c shows the scale error relative to theground truth, computed in percent, where negative errormeans shrinking of the bounding box. The scale error with-out the pedestrian occlusion is kept between +1% and�4%.At the moment of occlusion, the scale shrinks momentarilyby 10%. The shrinking phenomenon is typical of our algo-rithm, and is in accordance with the discussion in (Collins,2003), namely that the bounding box tends to shrink due topartial occlusion of relatively uniform target regions.

Fig. 1. (a) Voting function Gr(f(x) � u;Du) (solid line) and its polynomial approximation C(f(x) � u;Du) (dotted line), plotted over three bins (bin widthDu = 16). (b) Vehicle sequence: position error of center of bounding box w.r.t. robust template tracking algorithm. (c) Vehicle sequence: scale error ofbounding box (negative error means shrinkage). (d) ‘‘Face” sequence: position error of center of bounding box w.r.t. ground truth marked manually. (e)‘‘ThreePastShop2cor” sequence: position error of center of bounding box w.r.t. ground truth marked manually; our tracker (solid line) and Frag-Track(dotted line). (f) ‘‘ThreePastShop2cor” sequence: scale error of bounding box w.r.t. ground truth marked manually; our tracker (solid line) and Frag-Track(dotted line).

2 A. Adam’s website: http://www.cs.technion.ac.il/~amita/fragtrack/fragtrack.htm.

858 D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861

We note that our new tracker is less accurate in terms oftarget’s position and scale, compared to the robust tem-plate matching algorithm. This is not surprising as thenew tracker uses a compressed representation of the initialtemplate, i.e. a collection of patch-histograms. Moreover, itdoes not make use of drift-correction nor of adaptive learn-ing of outliers. Nevertheless, it shows robustness to outliersand is able to track successfully without drifting away fromthe target.

4.2. Second example: face tracking

The following ‘‘face” sequence was taken fromAdam et al. (2006) and can be found in the author

website,2 including a manually marked ground truth. Afew frames with our tracking results are shown in Fig. 3.They demonstrate that our tracking method is quite robustunder extreme occlusion conditions. We note that in Frame539, the scale of our bounding box have shrinked by 30%.However, in this region 2/3 of the target is occluded!

The position accuracy of the tracking w.r.t. to the man-ually marked ground-truth is plotted in Fig. 1d, and can becompared directly with the similar plot given in (Adamet al., 2006) – see Fig. 6 there – where the Frag-Track

Fig. 2. Frames from vehicle sequence, where the tracker deals with occlusions caused by a crossing pedestrian, as well as with background pixels,appearance changes due to reflections and scale change.

Fig. 3. Frames from the ‘‘face” sequence, where the tracker deals with severe occlusions.

D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861 859

method is compared with the mean-shift tracker. Compar-ing our Fig. 1d with Fig. 6 of Adam et al. (2006), it can beconcluded that our method clearly outperforms the mean-shift tracker. Moreover, our results are as accurate as thoseof the Frag-Track method. Note that the general behaviourof the two methods – ours and Frag-Track – is qualitativelysimilar, i.e. accuracy is decreased in the most occludedframes.

4.3. Third example: pedestrian tracking

Finally, we use the sequence ‘‘ThreePastShop2cor”,taken from the CAVIAR datasets. We compare our resultswith those obtained in (Adam et al., 2006) and with the

Fig. 4. Frames from sequence ‘‘ThreePastShop2cor”, where the tracker dealschange.

ground truth supplied by the CAVIAR datasets. We usethe same sub-sequence and the same initial bounding boxas in (Adam et al., 2006). This initial bounding box is smal-ler than the ground truth supplied by the CAVIAR data-sets. To get the ground truth which fits our initialbounding box, the CAVIAR ground truth was used tocompute the transformations (assuming translations andscale) between our first frame and all other frames. Thesetransformations were applied to our initial bounding boxand used as our ground truth.

Fig. 4 displays some frames with the tracking results.The pedestrian being tracked in this sub-sequence is under-going appearance changes, especially when by-passing thepedestrian on its left. It overlaps partially with other

with appearance changes, partial overlap with other pedestrians and scale

860 D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861

pedestrians. During tracking, the area of the bounding boxshrinks down to 40% of its original size. In Fig. 1e and f, weplot the error of the position and scale for our method(solid line) and for Frag-Track (dotted line), w.r.t. to theground truth (negative scale error means shrinkage of thebounding box). The results show that our method com-petes favourably with Frag-Track. On the average, Frag-Track scale errors are larger than ours by about 30%, whilepositions errors are larger by about 45%.

4.4. Performance

Our tracker was implemented in C using IPP. To dem-onstrate its ultra real-time performance, we use the lastexample, i.e. sequence ‘‘ThreePastShop2cor”. In thisexample the initial template size is 122 � 37, or 4514 pix-els, represented by 116 overlapping 8 � 8 patches. As eachpatch-histogram contains 16 bins, the overall histogramhas 1856 bins, or 40% of the number of pixels. The aver-age running time per frame is 3.3 ms on a P4 3 GHz PC.In (Porikli, 2006), some processing times are presented forvarious state-of-the-art tracking methods, for tracking ofa 20 � 40 object on a similar PC. In particular, the corre-sponding run-time of the mean-shift tracker of Comaniciuet al. (2003), and of the particle filter tracker of Bou-aynaya et al. (2005) is 12 and 25 milliseconds, respec-tively. The multi-kernel mean-shift tracker and thecovariance tracker (Porikli, 2006), require 20 and 150 mil-liseconds, respectively. Clearly, our new tracker outper-forms these methods. In particular, it outperforms theconventional mean-shift tracker of Comaniciu et al.(2003), by a factor of 20.

5. Conclusions

In this work we present a novel approach to histogram-based tracking. It establishes a closer link between templatematching and histogram-based tracking methods. This isachieved by removing the spatial kernel from the definitionof the histogram, and by introducing the warping parame-ters directly into the feature kernel. The advantages of ourapproach over kernel-based methods are many-fold: fasterconvergence; patch-histograms of the object are definednaturally within our formalism and enable to take intoaccount spatial information, as well as to mask outlierregions; the weights assigned to the patches by the robuststatistics scheme depend on the structure of the templaterather than on the structure of the spatial kernel; the useof multiple patches enables us to extract enough informa-tion even from gray-scale images.

We demonstrate the validity of our method by applyingit to several video clips, namely vehicle, face and pedestriantracking. Our tracking approach copes well with appear-ance changes, partial occlusions and scale changes. Thisis achieved by using neither color nor background informa-tion. In terms of running time, our method outperformsrecent state-of-the-art results.

We have also demonstrated that our method competesfavorably with a recent tracking scheme, Frag-Track(Adam et al., 2006), which also retains spatial informationby using multiple patches, however which relies on anexhaustive search rather than on an optimization scheme.The advantages of the Frag-Track method over ours, isthat first it allows the use of non-analytical metrics forcomparing two histograms, and second, it is less likely tobe stuck on a local minima. However, as the method relieson the use of integral histograms, the number of bins usedis limited. Also limited in principle is the tracking accuracy,as the search over different positions and scales is discrete.In addition, the method is limited to a transformation con-sisting of translations and scale.

There are several interesting topics for future work. Weplan to extend the implementation and testing of ourmethod, such as to incorporate motion models more gen-eral than translations and scale, and additional features,such as edge oriented histograms. The second topic con-cerns the instability of the target’s scale due to severe occlu-sions or pose changes. One solution would be to use prioron the warp parameters, by adding an extra term to theminimization scheme, as was done in (Baker et al., 2004)for the case of the template matching algorithm. The roleof such prior is to encourage the warp parameters to takesmall incremental values during the iterations.

Additional topic concerns the difficulty to track undervery large scale variations. In the context of templatematching (Dedeoglu et al., 2006), it was noted that whenobjects appear very small in comparison to the originaltemplate, they need to be enlarged through interpolationand this reliance on interpolation turns out to degradeperformance. The corresponding problem for histo-gram-based tracking is that when objects appear verysmall in comparison to the reference region, the individ-ual patches contain too few pixels and thus the individ-ual patch-histograms are very noisy. A plausible solutionwould be to maintain a multi-scale representation of thereference object, i.e. to extract patch-histograms fromsub-sampled images of the reference object, using a con-stant patch size. When the object appears small enoughrelative to the current reference resolution, the algorithmwould automatically switch to another reference histo-gram taken from a lower resolution image.

Further research topics would be to try different histo-gram similarity metrics to reduce quantization effects; toexplore other robust schemes (e.g. the iteratively adaptiveleast-squares); as mentioned in (Adam et al., 2006), itwould be interesting also to find a way to choose the mostinformative patches with respect to the tracking task(Vidal-Naquet and Ullman, 2003).

References

Adam, A., Rivlin, E., Shimshoni, I., 2006. Robust fragments-basedtracking using the integral histogram. In: Proc. IEEE Conf. onComputer Vision and Pattern Recognition (CVPR).

D. Schreiber / Pattern Recognition Letters 29 (2008) 852–861 861

Baker, S., Matthews, I., 2004. Lucas–Kanade 20 years on: A unifyingframework. Internat. J. Comput. Vision 56 (3), 221–255.

Baker, S., Gross, R., Matthews, I., 2004. Lucas–Kanade 20 years on: Aunifying framework. Technical Report CMU-RI-TR-04-14, CarnegieMellon University Robotic Institute.

Beymer, D., McLauchlan, P.F., Coifman, B., Malik, J., 1997. A real-timecomputer vision system for measuring traffic parameters. In: Proc. ofthe IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).

Birchfield, S., Rangarajan, S., 2005. Spatiograms vs. histograms for regionbased tracking. In: Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR).

Bouaynaya, N., Qu, W., Schonfeld, D., 2005. An online motion-basedparticle filter for head tracking applications. In: Proc. IEEE Internat.Conf. on Acoust. Speech Signal Process.

Bradski, G., 1998. Computer vision face tracking as a component of aperceptual user interface. In: Proc. Workshop Applications ComputerVision, 214–219.

Collins, R., 2003. Mean shift blob tracking through scale space. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),pp. II: 234–240.

Comaniciu, D., Visvanathan, R., Meer, P., 2003. Kernel based objecttracking. IEEE Trans. Pattern Anal. Machine Intell. 25 (5), 564–575.

Dedeoglu, G., Kanade, T., Baker, S., 2006. The asymmetry of imageregistration and its application to face tracking. Technical ReportCMU-RI-TR-06-06, Carnegie Mellon University Robotics Institute.

Elgammal, A., Duraiswami, R., David, L., 2003. Probabilistic tracking injoint feature-spatial spaces. In: Proc. IEEE Conf. on Computer Visionand Pattern Recognition (CVPR).

Fan, Z., Wu, Y., Yang, M., 2005. Multiple collaborative kernel tracking.In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition(CVPR).

Georgescu, B., Meer, P., 2004. Point matching under large imagedeformations and illumination changes. IEEE Trans. Pattern Anal.Machine Intell. 26, 674–689.

Hager, G.D., Belhumeur, P.N., 1998. Efficient region tracking withparametric models of geometry and illumination. IEEE Trans. PatternAnal. Machine Intell. 20 (10), 1025–1039.

Hager, G., Dewan, M., Stewart, C., 2004. Multiple kernel tracking withSSD. In: Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR).

Huber, P.J., 1981. Robust Statistics. Wiley-Interscience.Isard, M., Blake, M., 1998. Condensation: Conditional density propa-

gation for visual tracking. Int. J. Comput. Vision (IJCV) 29 (1), 5–28.

Ishikawa, T., Matthews, I., Baker, S., 2002. Efficient image alignment withoutlier rejection. Technical Report CMU-RI-TR-02-27, CarnegieMellon University Robotic Institute.

Lucas, B., Kanade, T., 1981. An image registration technique with anapplication to stereo vision. In: Proc. of the Internat. Joint Conf. onArtificial Intelligent, pp. 674–679.

Matthews, I., Ishikawa, T., Baker, S., 2004. The template update problem.IEEE Trans. Pattern Anal. Machine Intell. 26 (6), 810–815.

Meer, P., Mintz, D., Rosenfeld, A., 2004. Robust regression methods forcomputer vision: A review. Internat. J. Comput. Vision 6 (1), 59–70,1991.

Perez, P., Hue, C., Vermaak, J., Gangnet, M., 2002. Color-basedprobabilistic tracking. European Conf. on Comput. Vision, pp. 661–675.

Porikli, F., 2006. Achieving real-time object detection and tracking underextreme conditions. J. Real-Time Image Process.

Schreiber, D., 2007. Robust template tracking with drift correction.Pattern Recognition Lett. 28 (12), 1483–1491.

Smith, S.M., Brady, J.M., 1995. Asset-2: Real-time motion segmentationand shape tracking. IEEE Trans. Pattern Anal. Machine Intell. 17 (8),814–820.

Vidal-Naquet, M., Ullman, S., 2003. Object recognition with informativefeatures and linear classification. In: Proc. of the 9th IEEE Internat.Conf. on Computer Vision.

Yang, C., Duraiswami, R., Davis, L., 2005. Efficient mean-shift trackingvia a new similarity measure. In: Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR).

Yokoyama, M., Poggio, T., 2005. A contour-based moving objectdetection and tracking. In: Second Joint IEEE Internat. Workshopon Visual Surveillance and Performance Evaluation of Tracking andSurveillance.

Zhao, Q., Tao, H., 2005. Object tracking using color correlogram. In:IEEE 2nd Joint Internat. Workshop on Visual Surveillance andPerformance Evaluation of Tracking and Surveillance, pp. 263–270.