robust tracking via discriminative sparse feature selection

Vis ComputDOI 10.1007/s00371-014-0984-8

ORIGINAL ARTICLE

Robust tracking via discriminative sparse feature selection

Jin Zhan · Zhuo Su · Hefeng Wu · Xiaonan Luo

© Springer-Verlag Berlin Heidelberg 2014

Abstract In this paper, we propose a novel generativetracking approach based on discriminative sparse featureselection. The sparse features are the discriminative sparserepresentation of samples, which are achieved by learninga compact and discriminative dictionary. Besides the tar-get templates, the proposed approach also incorporates theclose-background templates to approximate the partial vari-ations. We learn the dictionary and a classifier together, andsearch the tracking result with the maximum similarity andthe minimal reconstruction error criterion using the discrimi-nation of sparse features. In addition, we resample the close-background templates and update the dictionary in an adap-tive way during tracking. Experimental results on severalchallenging video sequences demonstrate that the proposedapproach has more favorable performance than the state-of-the-art approaches.

J. Zhan · Z. Su · H. Wu · X. LuoNational Engineering Research Center of Digital Life,State-Province Joint Laboratory of Digital Home InteractiveApplications, School of Information Science and Technology,Sun Yat-sen University, Guangzhou 510006, China

Z. Su (B)Institute of Dongguan-Sun Yat-sen University,Dongguan 523000, Chinae-mail: [email protected]

H. Wue-mail: [email protected]

X. Luoe-mail: [email protected]

J. ZhanInstitute of Computer Sciences, Guangdong PolytechnicNormal University, Guangzhou 510665, Chinae-mail: [email protected]

Keywords Object tracking · Sparse representation ·Template dictionary · Discriminative sparse feature

1 Introduction

Object tracking is one of the fundamental and important prob-lems in many computer vision tasks including surveillance,motion estimation, human–computer interfaces and so on.Although a large amount of tracking approaches have beenproposed in the past decades, designing a robust trackingalgorithm is still a very challenging problem. The main chal-lenges are intricate situations such as occlusion, illuminationchanges, pose variation, etc. In our paper, we focus on devel-oping a more robust tracking approach in the sparse repre-sentation framework, which can handle pose variations andocclusions without significant drift.

In this paper, we model the target appearance with anextended template set which can approximate both the par-tial appearance variations and occlusions. By learning a dis-criminative template dictionary, we track the target with asparse features selection algorithm and propose an adaptiveupdate scheme to account for appearance variations of thetarget and background. We mainly discuss two key issues inthis paper: an appearance model which handles object occlu-sions and pose variances, and a search strategy for findingthe best tracking result in a new frame.

For modeling an effective target appearance, we cast thetracking problem as finding a sparse approximation by anextended template dictionary. The approaches of [1,3,7,11]model the target as a single entity and handle the occlusions asa sparse noise component using trivial coefficients. However,when appearance variations occur, the trivial coefficients mayhave large residuals and lead to degraded tracking result.Therefore, these approaches cannot handle partial variations

123

J. Zhan et al.

Fig. 1 Main components of theproposed tracking approach. Inthe first frame t0, the initialtemplate dictionary D isspanned by target templates,close-background templates andtrivial templates. Matrix Q andH are the discriminative sparsecode and class labels forclassification. Then, we learnthe discriminative templatedictionary D and a linearclassifier W jointly byLC-KSVD algorithm. In theframe t , we search the trackingresult by the sparse featureselection algorithm and updatethe dictionary adaptively

very well. We observed that the spatial relationship of thetarget and the surrounding is kept when the target has vari-ations in a period of time, and the surrounding informationcan be used to approximate the target variation. Motivatedby this, we introduce the close-background templates sam-pled from target surrounding as the negative samples intothe sparse appearance model. Different with no learning dic-tionary process in approaches of [1,3,7,11], we learn thetemplate set of target appearance model into a discriminativedictionary on which the sparse coding has large sparsity anddiscrimination. In terms of candidate target searching, thediscrimination of sparse coding and the minimum reconstruc-tion error are combined together to get more accurate trackingresults.

Figure 1 shows the main procedure of the proposed track-ing approach. With the prior information from the target andbackground in the first frame, a compact template dictionaryand a linear classifier are learned jointly by a discrimina-tive dictionary leaning approach LC-KSVD [2]. The templatedictionary helps to get the discriminative sparse representa-tion of samples, which are used as the sparse features. Wealso develop a sparse feature selection algorithm to searchthe tracking result by the maximum similarity and the min-imal reconstruction criteria. To account for the appearancechanges of the target and background during tracking, weadopt an adaptive update scheme to update the dictionary.Experiments show that our algorithm can deal with heavy

occlusion, illumination variation, pose variations and clut-tered background effectively.

The contributions of our tracking approach are summa-rized as follows:

• Model the target appearance as an extended template setwith target templates, close-background templates andtrivial templates, which adapts to partial appearance vari-ations dynamically.• Propose a sparse feature selection algorithm to search

the tracking result by learning a discriminative templatedictionary.• Present an adaptive update scheme which updates the

dictionary by resampling the close-background templatesduring tracking process.

2 Related work

Many tracking approaches have been proposed in the pastdecades, such as particle filter [5,27,32], Boosting algo-rithms [4,17,24], L1 tracker [1,3,11], etc. Recently, genera-tive and discriminative approaches are two major categoriesused in tracking. The generative approaches formulate thetracking as either finding the candidates most similar to thetarget model [1,16,29,30] or searching for the regions with

123


the highest likelihood [15,18,19,22,31]. Incremental sub-space model for searching is also proposed in [28,33]. Thediscriminative approaches formulate tracking as a classifi-cation problem that distinguishes the target from the back-ground. Many classifiers and their variations are proposed,such as Naive Bayesian classifier [12], boosting classifier[4,17,24], Support vector machine [25,26], P-N learningclassifier [6,14], etc. Generally, the discriminative modelsperform better when the training set size is large, while gener-ative models achieve higher generalization with limited data.

Sparse representation-based tracking algorithms havemade much progress in recent years [1,3,7,8,11,13]. Theyoften model an sample y using a sparse linear approxima-tion over a template set. The approaches of [1,3,11] find thetracking result with minimal reconstruction error by templatematching. In [1], Mei et al. use L1 minimization algorithmby casting the tracking problem as finding the most likelypatch with sparse representation and handling partial occlu-sion with trivial templates:

y ≈ T z + e = [T, I ][

ze

]= Dc, (1)

where T and I are target templates and trivial templates,respectively. z is target coefficient vector, and e is the trivialcoefficient vector which indicates the occlusions or noises.Mei et al. enforce nonnegativity constraints on coefficientsand propose extending the trivial templates by including neg-ative trivial templates as well. Equation (1) can be solved via�1 minimization.

arg minc‖y − Dc‖22 + λ‖c‖1. (2)

The assumption of [1] is that each candidate image patchis sparsely represented by a set of target and trivial templates,hence it can be used to handle partial occlusion. Each tem-plate in T has a weight in the template update scheme. If thetracking result y is not similar to the current template set T ,it will replace the least important template in T .

In [7], Wang et al. use principal component analysis (PCA)basis vectors U to represent the target templates T , and solvecoefficients by

arg minz,e‖y −U z − e‖22 + λ‖e‖1, (3)

where e should be sparse while z are not sparse as PCAbasis vectors are not coherent but orthogonal. They explorethe trivial coefficients for occlusion detection, and computethe occlusion ratio η to apply three kinds of operations: full,partial and no update.

The L1 tracker [1], L1-based tracker [3,11] and theSRPCA tracker [7] all use two types of templates to modelthe target appearance: target template T and trivial templateI . Since the target templates cannot adapt to the appearance

variations of object, some object information in trivial resid-uals have not been fully used. They need to update the tar-get templates once the object has the appearance change.Although the approach of [11] reduces the sample numberneeded in L1 minimizations, but still has many �1-norm-related minimizations for solving. It is very time consumingfor solving minimization problem when updating the appear-ance templates. We present an effective representation thatapproximates the partial appearance change by the close-background templates, and develop a discriminative sparsefeature selection algorithm for tracking.

3 Appearance representation and template dictionarylearning

3.1 Target appearance representation

In the sparse representation framework, the object appear-ance is modeled by the template set which is assembledinto a holistic and over-complete initial dictionary for sparselearning. In fact, only the target of the first frame is givenin the tracking applications. Hence, the samples for the tar-get templates are not sufficient, especially when the featurespace is large. Many sparse tracking approaches use addi-tional large trivial templates [1,3,7,11] or local sparse repre-sentation [8,13] to ensure the over-complete dictionary, butthese often increase the computational expense.

In our approach, as noted in Sect. 1, we incorporate theclose-background templates into the Eq. (1) to approximatethe partial appearance variations. The appearance space isa set combined by three types: target templates T , close-background templates B and trivial templates I . When thetemplate set is learned to be a dictionary, a template exists inthe linear subspace expanded by basis vectors of dictionarycorresponding to T or B. We denote RT and RB as the linearvector subspace corresponding to T and B, respectively.

For a sample in the consequent frame, we can use a sparselinear approximation to model it in two cases. First, whenno target variation occurs, the candidate sample y exists inthe linear subspace RT , y≈ T z+ e, where z is the targetcoefficient, and e is the trivial coefficient which representsthe occlusions or noise. Similarly, the non-candidate sampleexists in the linear subspace RB , y′≈ Bv+e, where v is thevariation coefficient which approximates the partial varia-tions. Second, if partial variations occur, all samples exist inthe linear subspace RB . Hence, a sample y can be representedas:

y ≈ T z + Bv + e = [T, B, I ]⎡⎣z

v

e

⎤⎦ = Dγ, (4)

123

J. Zhan et al.

Fig. 2 The sampling process of templates. The red boxes are positivesamples for the target templates T which are sampled densely around thegiven target center within a radius. The blue boxes are negative samplesfor the close-background templates B which are sampled randomly inan annulus around the given target

where γ is the sparse coding of sample y on dictionary D, itcontains target coefficient z, variation coefficient v and trivialcoefficients e.

The sampling process of templates is showed in Fig. 2. Thetarget templates are sampled at frame t0. Let l0(c) denote thecenter location of the target. We densely crop out some sam-ples T = {t | ‖l0(t)− l0(c)‖ < r0} ∈ R

n×p around the targetcenter within a radius r0, t is represented by the n-dimensionfeature vector, and p is the number of target samples. Theclose-background templates are sampled at frame t0 and areresampled when updating the dictionary. We randomly cropout some samples B = {b |‖r1 < |l0(b) − l0(c)‖ < r2} ∈R

n×q within an annulus around the target center, b is rep-resented by the n-dimension feature vector, r0 < r1 < r2,and q is the number of close-background samples. I is a n-order unit matrix, I ∈ R

n×n . The initial templates dictionaryD ∈ R

n×h (h = p + q + n) is combined by the matrixes ofT , B and I , and is used to learn a discriminative dictionaryin the next section.

3.2 Discriminative dictionary learning

It is important for sparse representation-based trackingapproach to learn an effective dictionary. In most of the pre-vious tracking approaches, the dictionary is the normalizedtemplate set and has no learning process; hence the sparsecoding does not have effective sparsity and discriminationfor tracking. Recently, a supervised dictionary learning tech-nique for sparse coding has been proposed in image analysis[34] and image classification [35], which unifies dictionarylearning and classifier training into a mixed reconstructiveand discriminative formulation.

In this section, we utilize the supervised dictionary learn-ing method LC-KSVD [2] into the tracking application. Itlearns a discriminative dictionary with the supervised classi-fication information of the first frame. More specifically, theLC-KSVD method explicitly incorporates a sparse codingerror criterion and a classification performance criterion into

the objective function. The objective function in Eq. (2) isextended as:

arg minD,W,A,γ

‖Y − Dγ ‖22 + α‖G − Aγ ‖22+β‖S −Wγ ‖22 + λ‖γ ‖1. (5)

The second term in Eq. (5) represents the sparse codingerror, which enforces the sparse codes γ approximate thediscriminative sparse codes G, and A is a linear transforma-tion matrix which transforms the original sparse codes γ tobe most discriminative in sparse feature space. G ∈ R

h×h

are the discriminative sparse codes of initial templates setfor classification. In our algorithm, we label the templates Tas positive samples, and label the templates B, I as negativesamples in the first frame. Thereby, G is defined as:

G =[

g1 00 g2

], g1 ∈ R

p×p, g2 ∈ R(q+n)×(q+n),

where matrices g1 and g2 are all 1’s matrix.The third term represents the classification error. S =

[s1 . . . sn] ∈ R2×h is a label vector corresponding to ini-

tial templates set. W ∈ R2×h is classifier parameters, where

superscript 2 is the class number. α and β are the scalarscontrolling to the relative contribution of the correspondingterms. To find the optimal solution for all parameters simul-taneously, Eq. (5) is rewritten as:

arg minDnew,γ

‖Ynew − Dnewγ ‖22 + λ‖γ ‖1, (6)

where Y is changed to be Ynew = (Y T ,√

αGT ,√

βST )T andD is changed to be Dnew = (DT ,

√αAT ,

√βW T )T . Equa-

tion (6) can be exactly optimized by K-SVD algorithm. Thedictionary Dnew learned in this way will generate discrimi-native sparse coding γ . It yields that the same class sampleswhich have similar sparse codes and can be used directly bya linear classifier W .

4 Tracking via discriminative sparse feature selection

4.1 The sparsity of features

In the current frame t , we sample a set Y around the targetcenter of the previous frame, Y = {y | ‖l0(y)−l0(c)‖ < r2} ∈R

n×m , where r2 is the search window size which determinesthe scope of searching the tracking result. Then we computetheir sparse coding � = [γ1, . . . , γm] on the discriminativetemplate dictionary. The sparse coding has large sparsity andis regarded as the sparse features of samples in our approach.For each sparse feature γi = [zi , vi , ei ]T (i = 1…m), wemodel it as an additive combination of two independent com-ponents: prime coefficients set [zi , vi ]T and trivial coeffi-cients set [ei ]T .

123


Fig. 3 The sparse features of good sample y1 (red box) and bad sampley2 (blue box). a The image of frame #300, b the template dictionaryD which learned by LC-KSVD. c, d Sparse features of good and badsample: c j1

1 and c j22 are prime coefficients, where j1 and j2 are the row

numbers of prime coefficients

The prime coefficients set has only one non-zero elementc j

i . It is also the maximum value in γi , where j is the row

number of ci . c ji indicates that the sample xi has the maximal

linear relativity with the column basis vector d j in D. we callthe d j as the prime basis vector of sample xi . On the otherhand, the trivial coefficient [ei ]T have some residual items farless than c j

i which represent occlusions or noises. The largerthe value of ei , the more serious the occlusions or noises.

Intuitively, a good target candidate can be efficiently repre-sented by the target templates with small trivial coefficients.Figure 3 shows the sparse features of good and bad can-

didates. Let the basis vector of initial given target in D isdenoted as d0. The prime basis vector d j1 of good candidatey1 is corresponding to a target template around d0 tightly andhas some trivial residuals e1. For the bad target candidate y2,d j2 is corresponding to a close-background template and hasmore larger residuals than e1. When occlusions occur, a goodtarget candidate is represented by the close-background tem-plates with small trivial residuals, and a bad candidate oftenwith large trivial residuals.

Generally, the prime basis vector d j of the good candidateoften has the maximum similarity with d0 with small resid-ual trivial. When the noticeable variations occur, d j would bemore likely to have the maximum similarity with the primebasis vector dp of the target at the previous frame. This sug-gests that searching a good candidate is a comprehensivemeasure of d0 and dp. We use the Euclidean distance to mea-sure the similarity between two basis vectors. In the E

k spacespanned by basis vectors d1, . . . , dh , basis vectors of targettemplates are very close around d0 due to their dense sam-pling, see Fig. 4 step 3b (green circle points), while the basisvectors of close-background templates are scatter distribu-tion, see Fig. 4 step 3b (green star points). Hence, the prob-lem of finding the best candidate is transformed to searchingthe sample which has the maximum distance with the initialtarget and the target of previous frame under the constraintof reconstruction error.

4.2 Sparse feature selection algorithm for tracking

In the sparse representation-based approaches, the samplewhich has the minimal reconstructive error is not always the

Fig. 4 The procedure of oursparse feature selection (SFS)algorithm

123

J. Zhan et al.

best tracking result. We propose a sparse feature selectionalgorithm which exploits the discrimination of sparse fea-tures to construct the maximum similarity and the minimalreconstruction error criterion.

Figure 4 shows the main procedures of sparse featureselection algorithm which is divided into four steps. In thek-dimensional Euclidean space E

k , the basis vector of ini-tial target is denoted as d0 and the basis vector of tar-get of previous frame as dp. For each sample yi with the

prime coefficient c ji in the current frame, g(i) is the distance

between d0 and d j , and s(i) is the distance between dp andd j . The objective function of searching result L(i) is givenas

L(i) = cs(i)+ ξexp

(−‖yi − Dγi‖22

2σ 2

). (7)

where cs(i) is the similarity which is the weighted linearcombination of g(i) and s(i), cs(i) = α1g(i) + α2s(i),and α1 and α2 are the weights controlling the relative con-tribution of the corresponding terms. The second term inEq. (7) indicates the influence of minimal reconstructorerror, and ξ is a regularization parameter. Thus, the besttracking result is computed by optimizing the followingcriteria:

i = arg maxi

L(i), i = 1 . . . m. (8)

Our algorithm is simple and effective. In addition, we alsodefine a reconstruction error threshold Te to constraint thetracking result. If the reconstruction error of tracking resultis greater than Te, then we replace the tracking result by thesample which has the minimal reconstruction error in wholesample set. The pseudo code of the SFS algorithm is givenas follows:

Algorithm 1: Sparse Feature Selection (SFS) Algorithm

Input: Samples in current frame X : Y = [y1 . . . ym ],Sparse features of samples �: � = [γ1 . . . γm ],Dictionary D, Classifier W

Output: i : index of the best result sample1: for i = 1 to m do2: j ← the row number of non-zero element in c j

i3: g(i)← exp(−‖d j − d0‖22/2σ 2

1 )

4: s(i)← exp(−‖d j − dp‖22/2σ 21 )

5: L(i)← {α1g(i)+ α2s(i)} + ξexp(−‖yi − Dγi‖22/2σ 2)

6: end for7: i ← arg maxi L(i)8: if ‖yi − Dγi‖2 > Te

9: i ← arg maxi ‖yi − Dγi‖210: end if11: if W (yi ) = 0 & d j ∈ RB & ‖yi − Dγi‖2 > Te12: B ← Resampling13: D← Relearning by Eqn (6)14: end ifReturn

4.3 Adaptive update for dictionary

A fixed template dictionary is not sufficient to handle thevariations. However, frequent updates would accumulate thetracking error and lead to drift. In our approach, the vari-ations of target are approximated by the close-backgroundtemplates sampled from surrounding of the target. Hence,once the surrounding changes greatly, the close-backgroundtemplates would not approximate the variations anymore andneed to be updated. We adopt an update scheme which resam-ples the close-background templates and updates the dictio-nary adaptively.

In the first frame, a linear classifier W is trained by label-ing the target templates T as positives and labeling the close-background templates B as negatives. For the linear classi-fier W is not adapted for the nonlinear change of the target,we combine the classification and the sparse decompositionof tracking result to trigger the update event. In the currentframe, the tracking result is yi with the sparse coding γi ,the prime coefficient is c j

i , and d j is the prime basis vec-tor. If W (yi ) = 0 and d j ∈ RB and ‖yi − Dγi‖2 > T e(the reconstruction error is greater than T e), it means thatthere is noticeable appearance variation occurring with thegreatly changing background, then we need to update thedictionary.

The process of update is carried out in a two-step way. Inthe first step, we resample the close-background templatesin the current frame. The target templates are fixed from thefirst frame to reduce the error introduced during the track-ing process. To maintain the same dimension dictionary, thenumber of resampled close-background templates is the sameas the sample number at the first. In the second step, we updatethe dictionary D and classifier W by relearning the dictio-nary by LC-KSVD. We show the tracking result comparisonof no-update tracking and tracking with updated dictionaryin our sparse framework at Sect. 5.2.

5 Experiments and performance

We evaluate our tracking algorithm in MATLAB on anIntel(R) core (TM) 3.09GHz machine with 3.42 GB RAM.In the experiments, the target sample radius r0 is 4 pixels andthe target templates number is 45. The annular radius r1 = 8pixels. The parameter of Haar-like method is 150 and use 150trivial templates. In Eq. (6), the sparsity prior parameter is5, the iteration number is 25, the weight for label constraintterm α and for classification error term β is 4 and 2, respec-tively. In our experiments, the values of search window sizer2 on different video sequences are showed in Fig. 8.

The weights α1 and α2 of similarity control the relativeimportance to the target of first frame and the target of pre-

123


Fig. 5 The center locationerrors of three weight settingsfor similarity on DavidIndoorsequence (ACLE average centerlocation errors)

vious frame. α1 < α2 indicates that the candidate sample isselected with more relevance to the target of previous frame.Conversely, α1 > α2 indicates with more relevance to thetarget of first frame. And α1 = α2 indicates the average sim-ilarity of candidate sample. Here, we show the difference ofthree weight settings on DavidIndoor sequence by the centerlocation errors which is the Euclidean distance between thecenter locations of the tracking result and the ground truth ofeach frame.

In Fig. 5, it can track the target better at the case of greatvariations occurring at frame #145 when α1 < α2, but itis not easy to find the target again when the target recov-ers at frame #200. Meanwhile, once the tracking result ofthe previous frame has error, it will accumulate the errorand soon drift. α1 = α2 has better performance than thecase of α1 < α2 in terms of average center location errors(ACLE), but slightly poorer than the case of α1 > α2.α1 > α2 has the best overall results among the three cases.In our experiments, we found that it is hard to achieve theexpected purpose of tracking in the case of α1 < α2; there-fore, a suitable solution is selecting a better overall per-formance setting. We set α1 = 0.7 and α2 = 0.3 in ourexperiments.

5.1 Discriminative power analysis

In our experiments, we found that a sample only havingthe minimal reconstruct error is not always the best track-ing result for intricate target variations. It is worthwhile toevaluate not only the reconstruction error but also discrim-inative power for a sparse dictionary learning approach. Inthis section, we take the contrast measure approach of [13]to compare the discriminative power between LC-KSVD [2]and K-SVD [23]. The positives (target samples) are extractedaround the target within a radius r0, and the same number ofnegatives (background samples) is extracted randomly out-side the radius r0 of the target. The dictionary is trained usingthe samples in the first frame.

Let Y+ and Y− indicate the set of positives and negatives,respectively; N is the number of positives. The reconstructionerror is measured by

E(Y ) = 1

N

N∑i=1

‖yi − Dγi‖. (9)

The difference between E(Y+) and E(Y−) is the discrim-inative power of the learned dictionary. The larger differ-ence |E(Y+)− E(Y−)| indicates the stronger discriminativepower. In Fig. 6a, the dictionary learned by K-SVD has asmaller reconstruction error than LC-KSVD. But, Fig. 6bshows that the dictionary learned by LC-KSVD has muchhigher discriminative power than K-SVD. That is, the LC-KSVD method has relatively larger reconstruction error butstronger discriminative power, which helps to generate thediscriminative sparse coding for subsequent tracking. Thediscriminative power achieves relatively larger value whenthe number of positives is from 25 to 70, it reduces with theincrease of the positives for the added samples contain triv-ial and background and will contribute to the reconstructionerror.

5.2 Comparative analysis of update scheme

We compared the center location errors between no-updatetracking approach and tracking approach updated with dictio-nary on Singer1 and DavidOutdoor sequences in Fig. 7. Bothtwo sequences have great changed background. In Fig. 7a,compared with no-update tracking approach, the center loca-tion errors of updated tracking approach are slightly higherat some frames, but reduce a lot from frame #165 to #300. Itmeans that the centers of updated tracking results are moreaccurate than the former. In Fig. 7b, no-update tracking lostthe target and drifted away at frame #165, while the updatedtracking could track the target and had lower center loca-tion errors than the former. The ACLE of updated trackingresults are both less than no-update tracking results on twosequences. This suggests that the dictionary updating scheme

123

J. Zhan et al.

Fig. 6 The comparison of reconstruction error and discriminative power between LC-KSVD and K-SVD method on Lemming sequence

Fig. 7 The center location errors of no-update tracking approach and tracking approach updated with dictionary in the proposed framework onSinger1 and DavidOutdoor sequences

helps to find the best tracking target when the backgroundchanged.

5.3 Qualitative evaluation

In our work, we evaluate our method against 8 state-of-the-artapproaches on 12 challenging sequences which are publiclyavailable from the Tracker Benchmark v1.0 [21]. We com-pare the Frag tracker [16], OAB tracker [17], MIL tracker [4],the multi-feature decomposition-based VTD tracker [9], thePN learning-based TLD tracker [6], the local sparse appear-ance model-based LSK tracker [13], L1-APG tracker [3]and the random projection-based compressive CT tracker[10]. The video sequences have challenging factors includ-ing occlusions (e.g., FaceOcclu1, FaceOcclu2), illuminationand scale variations (e.g., Singer, Shaking and DavidIndoor),pose variations (e.g., Basketball, Singer2, Bolt, DavidOut-door and Crossing), background clutter and fast motion(e.g.,Lemming and Deer).

Heavy occlusion

Figure 8a and b shows the tracking results of differentalgorithms in occlusion sequences. These sequences containnoticeable occlusions. In the FaceOcclu1 sequence, the pro-posed, L1-APG, Frag and VTD trackers perform better. Theproposed and L1-APG trackers handle occlusion using sparserepresentation with trivial templates. While in the FaceOc-clu2 sequence, VTD performs best when in-plane rotationoccurs. MIL, CT and the proposed trackers perform better.The L1-APG tracker performs poorly for it only takes newimage observations for updates without factoring out occlu-sion.

Illumination and scale change

Figure 8c–e shows the results of challenging sequences withsignificant change of illumination, scale change and partialappearance variations. In the Singer sequence, illumination

123


Fig. 8 Qualitative evaluation results on 12 challenging sequences. In our approach, r2 is the search window size which determines the scope ofsearching samples in the subsequent frame

and scale change are the main challenges. The TLD, OABand VTD trackers perform best. The proposed tracker, MIL,and Frag trackers do not adapt to scale or in-plane rotationand drift in some ways. Although L1-APG tracks the objectcenter effectively, it could not find the right scope of objectin many cases. In the Shaking sequence, drastic change ofillumination and pose change make it difficult to track. TheVTD, proposed tracker and MIL tracker perform well. TheL1-APG tracker drifts away at frame #300 whereas the othermethods drift away at frame #59 when drastic illuminationvariation occurs. The TLD tracker is able to track the objectin some frames, and can re-locate the object by detectionmechanism when the target drift occurs. The proposed trackeruses generalized Haar-like features for object representationso as to be robust to illumination variation. Likewise, theVTD tracker uses multiple features to model object appear-

ance for illumination variation. The DavidIndoor sequencecontains illumination and appearance variations caused byscale and pose as well as camera motion. The TLD, CT andLSK trackers perform better. The L1-APG tracker does notperform well because their small amount of target templatescould not represent the variations effectively (e.g. #144). Theproposed tracker can track the object but drift in some framesas it does not take account of the scale change.

Pose variations

Figure 8f–j shows the tracking results of challenging sequen-ces with drastic pose variations, motion blur and scale varia-tion. The proposed tracker performs well in five sequences,this can be attributed to that the pose change of the object canbe well approximated by an extended templates subspace.

123

J. Zhan et al.

Table 1 The average overlap rate (higher is better) on 12 challenging sequences

Video clip Frag [16] OAB [17] MIL [4] VTD [9] TLD [6] LSK [13] L1-APG [3] CT [10] Ours

Occlusion1 0.68 0.18 0.54 0.70 0.54 0.50 0.76 0.69 0.74

Occlusion2 0.49 0.52 0.65 0.69 0.55 0.60 0.42 0.55 0.71

Singer1 0.23 0.69 0.34 0.45 0.73 0.50 0.25 0.23 0.36

Shaking 0.11 0.02 0.59 0.75 0.13 0.03 0.24 0.04 0.65

DavidIndoor 0.09 0.35 0.38 0.50 0.70 0.56 0.24 0.49 0.44

Basketball 0.65 0.05 0.23 0.03 0.06 0.02 0.06 0.24 0.78

Singer2 0.19 0.04 0.04 0.04 0.03 0.03 0.02 0.03 0.68

Bolt 0.02 0.02 0.01 0.02 0.01 0.30 0.02 0.02 0.52

DavidOutdoor 0.49 0.29 0.50 0.35 0.08 0.12 0.23 0.24 0.65

Crossing 0.29 0.65 0.74 0.31 0.43 0.04 0.63 0.58 0.70

Lemming 0.60 0.68 0.70 0.46 0.67 0.59 0.32 0.68 0.71

Deer 0.08 0.62 0.37 0.06 0.58 0.12 0.05 0.03 0.43

The bold data are the best values in the comparison

Fig. 9 The center location errors along with the frame number on 12 challenging sequences

123


Table 2 The average center location errors (lower is better) on 12 challenging sequences

Video Clip Frag [16] OAB [17] MIL [4] VTD [9] TLD [6] LSK [13] L1-APG [3] CT [10] Ours

FaceOcclu1 19.58 89.35 35.19 17.94 29.24 33.00 14.65 21.99 17.13

FaceOcclu2 39.83 27.61 17.06 10.72 17.96 15.22 12.62 22.57 11.66

Singer1 56.03 15.58 22.34 10.36 5.69 18.11 5.79 33.61 15.82

Shaking 178.08 217.93 14.38 6.71 198.45 110.12 36.33 162.87 11.16

DavidIndoor 99.62 27.46 21.22 15.39 7.32 10.03 54.49 10.12 14.42

Basketball 11.76 141.02 106.34 177.21 166.89 146.13 171.40 89.42 5.73

Singer2 97.11 188.13 168.81 184.25 219.11 187.04 197.38 181.99 12.43

Bolt 333.81 322.98 387.03 382.84 222.03 111.47 397.82 372.36 16.93

DavidOutdoor 60.78 90.24 33.44 68.08 196.83 227.13 142.69 103.84 13.45

Crossing 38.81 5.09 2.64 41.62 23.38 88.69 2.34 8.34 5.33

Lemming 16.55 10.65 9.67 77.68 10.01 20.11 138.74 11.43 10.13

Deer 98.93 14.80 57.38 212.68 59.09 176.84 213.14 243.52 27.07

The bold data are the best values in the comparison

For the Basketball, Singer2 and Bolt sequences, it is hard tohandle the drastic pose variations. The OAB, TLD and VTDtrackers do not perform well when the target has noticeablepose variation (e.g. Basketball, #64 and Bolt, #14). In thedavidOutdoor video sequence, pose variations and partialocclusions are challenging. The proposed performs betterwhen the partial occlusions occur while others drift away.In the Crossing sequence, L1-APG, MIL, OAB, CT and pro-posed trackers perform better than others. The LSK performspoorly for its local appearance model could not represent thetarget appropriately.

Cluttered background and fast motion

In Fig. 8k and l, motion blur caused by fast motion of targetin cluttered background is the main challenge. In addition,the Lemming sequence contains object occlusion and scalechange with fast motion. The L1-APG and VTD trackersperform poorly when fast motion occurs (e.g. #236), whileothers perform well. In the Deer sequence, the OAB trackerperforms well, and the TLD loses the target and re-locates itafter some frames.

Discussion

Our tracker can handle occlusions, pose variations, illumina-tion and cluttered background effectively, especially in posevariations occurring case. Occlusions are represented by triv-ial templates, and partial pose variations are approximatedby close-background templates with some trivial residuals.However, the proposed tracker cannot handle scale changefor tracking so that sometimes accumulates the error andleads to drift in some frames (e.g. Fig. 8c, Singer1 #295 andFig. 8e, DavidIndoor #153).

5.4 Quantitative evaluation

Quantitative comparison on test video sequences is an impor-tant issue that computing the difference between the predi-cated and the ground truth. Several evaluation frameworksand benchmark have been proposed. Pang et al. [20] pro-posed a page-rank-like approach to objectively evaluate var-ious visual trackers. Wu et al. [21] built a large benchmark onvisual tracking to evaluate the performances of trackers. Weuse average overlap rate, center location errors, average cen-ter location errors and precision plots to illustrate the quan-titative comparison for more comprehensive description.

Given the tracking result of each frame R and the cor-responding ground truth G, the overlap rate is defined as

olr = area(R⋂

G)

area(R⋃

G). An object is regarded as being successfully

tracked when the overlap rate score is above 0.5. Table 1 sum-marizes the average overlap rate of compared approaches onchallenging sequences. The overlap rate indicates stability ofeach approach as it takes the size and pose of the target objectinto account, but it would not precisely compare approachesthat do and do not return estimated scale.

In addition, the center location errors are widely usedto summarize the overall performance. As mentioned in[4,21], when the tracker lost the target in some frames, theaverage error value may not measure the tracking perfor-mance correctly. Here, we use half of the diagonal imagesize (0.5 ∗ √iw ∗ ih, where iw and ih are image widthand image height, respectively) for such case to calcu-late the average CLE. Figure 9 shows the CLE of eachapproaches on 12 sequences. The smaller the center loca-tion errors, the more accurate tracking. Table 2 summa-rizes the results in terms of ACLE of each approach on 12sequences. The bold data are the best values in the compar-ison.

123

J. Zhan et al.

Fig. 10 The precision plots by center location errors evaluation. The location error threshold is 20 which roughly corresponds to at least a 50 %overlap between the tracker and the ground truth [4]

We also include the precision plot [4,21] to show the per-centage of frames whose estimated location is within thegiven threshold distance of the ground truth. We chose thethreshold 20 for each tracker to represent the tracking preci-sion. Figure 10 shows the precision plots from tracking objectlocation. The result shows that our approach performs bet-ter on video sequences with occlusions, illumination change,pose variations and cluttered background, especially on somechallenging sequences (e.g., Shaking, Basketball, Bolt). Thepartial pose variations can be approximated by the close-background templates in the proposed approach whether thetarget is occluded or not, thereby further improving the track-ing results both in terms of overlap rate and center locationerror. But, our approach cannot handle scale change so asto have relative more lower overlap rate (e.g., Singer1 andDavidIndoor sequences). We also note that our approach haslower precision on fast moving target (e.g., Deer).

6 Conclusion

In this paper, we propose a novel tracking approach by learn-ing a compact and discriminative template dictionary. Weincorporate the close-background templates into the tem-plate set along with target templates and trivial templates forapproximating the partial appearance variations. Each tar-get candidate at a new frame is sparsely represented by thetemplates set. In addition, we develop a sparse feature selec-tion algorithm which exploits the discrimination of sparsefeatures to construct the maximum similarity and the mini-mal reconstruction error criterion. Furthermore, we adopt anupdate scheme to update the dictionary adaptively by resam-pling the close-background templates.

The proposed approach can find the tracking result moreaccurately and robustly. We compare the quantitative perfor-mance with eight stat-of-the-art approaches on various chal-

123


lenging video sequences. The results show that our approachcan handle occlusions, pose variations, illumination changeand cluttered background effectively. In future, we plan todevelop a more efficient algorithm to handle scale change oftarget object. In addition, we also plan to integrate multiplefeatures to describe objects better and utilize more effectiveclassifiers for object tracking.

Acknowledgments This research is supported by NSFC-GuangdongJoint Fund (No. U1135005, U1201252), the National Natural Sci-ence Foundation of China (No. 61320106008), Science and Technol-ogy Planning Project of Guangdong Province (No. 2012B010900009,2012B010900089).

References

1. Mei, X., Ling, H.: Robust visual tracking using L1 minimization.In: Proceedings of IEEE International Conference on ComputerVision (ICCV), pp. 1436–1443 (2009)

2. Jiang, Z., Lin, Z., Davis, L.S.: Learning a discriminative dictionaryfor sparse coding via label consistent K-SVD. In: Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1697–1704 (2011)

3. Bao, C., Wu, Y., Ling, H., Ji, H.: Real time robust L1 tracker usingaccelerated proximal gradient approach. In: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR),pp. 1830–1837 (2012)

4. Babenko, B., Yang, M., Belongie, S.: Visual tracking with onlinemultiple instance learning. In: Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pp. 983–990(2009)

5. Wu, H., Li, G., Luo, X.: Weighted attentional blocks for probabilis-tic object tracking. Vis. Comput. 30(2), 229–243 (2014)

6. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection.IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2011)

7. Wang, D., Lu, H.C., Yang, M.H.: Online object tracking with sparseprototyes. IEEE Trans. Image Process. 22(1), 314–325 (2013)

8. Jia, X., Lu, H.C., Yang, M.H.: Visual tracking via adaptive struc-tural local sparse appearance model. In: Proceedings of IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), pp.1822–1829 (2012)

9. Kwon, J., Lee, K.: Visual tracking decomposition. In: Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1269–1276 (2010)

10. Zhang, K., Zhang, L., Yang, M.: Real-time compressive track-ing. In: Proceedings of European Conference on Computer Vision(ECCV), pp. 864–877 (2012)

11. Mei, X., Ling, H., Wu, Y., Blasch, E., Bai, L.: Minimum errorbounded efficient L1 tracker with occlusion detection. In: Proceed-ings of IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), pp. 1257–1264 (2011)

12. Quan, W., Chen, J.X., Yu, N.Y.: Robust object tracking usingenhanced random ferns. Vis. Comput. 30(4), 351–358 (2014)

13. Liu, B., Huang, J., Yang, L., Kulikowsk, C.: Robust tracking usinglocal sparse appearance model and K-selection. In: Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1313–1320 (2011)

14. Kalal, Z., Matas, J., Mikolajczyk, K.: P-N learning: bootstrappingbinary classifiers by structural constraints. In: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR),pp. 49–56 (2010)

15. Comaniciu, D., Member, V.R., Meer, P.: Kernel-based object track-ing. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–575 (2003)

16. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based track-ing using the integral histogram. In: Proceedings of IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp.798–805 (2006)

17. Grabner, H., Grabner, M., Bischof, H.: Real-Time Tracking via On-line Boosting. Proceedings of British Machine Vision Conference(BMVC) 1(5), 47–56 (2006)

18. Black, M., Jepson, A.: Eigentracking: robust matching and track-ing of articulated objects using a view-based representation. Int. J.Comput. Vis. 26(1), 63–84 (1998)

19. Matthews, L., Ishikawa, T., Baker, S.: The template update prob-lem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 810–815 (2004)

20. Pang, Y., Ling, H.: Finding the best from the second bests—inhibiting subjective bias in evaluation of visual tracking algo-rithms. IEEE International Conference on Computer Vision(ICCV), pp. 2784–2791 (2013)

21. Wu, Y., Lim, J., Yang, M.: Online object tracking: a benchmark. In:Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 2411–2418 (2013)

22. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking andrecognition using appearance-adaptive models in particle filters.IEEE Trans. Image Process. 13(11), 1491–1506 (2004)

23. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithmfor designing overcomplete dictionaries for sparse representation.IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)

24. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-lineboosting for robust tracking. In: Proceedings of European Confer-ence on Computer Vision (ECCV), pp. 234–247 (2008)

25. Avidan, S.: Support vector tracking. IEEE Trans. Pattern Anal.Mach. Intell. 26(8), 1064–1072 (2004)

26. Tang, F., Brennan, S., Zhao, Q., Tao, H.: Co-tracking using semi-supervised support vector machines. In: Proceedings of IEEE Inter-national Conference on Computer Vision (ICCV), pp. 1–8 (2007)

27. Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based proba-bilistic tracking. In: Proceedings of European Conference on Com-puter Vision (ECCV), pp. 661–675 (2002)

28. Hu, W., Li, X., Zhang, X., Shi, X., Maybank, S.J., Zhang, Z.: Incre-mental tensor subspace learning and its applications to foregroundsegmentation and tracking. Int. J. Comput. Vis. 91(3), 303–327(2011)

29. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using modelupdate based on Lie algebra. In: Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pp. 728–735(2006)

30. Xue, M., Zhou, S.K., Porikli, F.: Probabilistic visual tracking viarobust template matching and incremental subspace update. In:Proceedings of IEEE International Conference on Multimedia andExpo (ICME), pp. 1818–1821 (2007)

31. Matthews, I., Baker, S.: Active appearance models revisited. Int. J.Comput. Vis. 60(2), 135–164 (2004)

32. Li, Y., Ai, H., Yamashita, T., Lao, S., Kawade, M.: Tracking inlow frame rate video: a cascade particle filter with discriminativeobservers of different life spans. IEEE Trans. Pattern Anal. Mach.Intell. 30(10), 1728–1740 (2008)

33. Ross, D., Lim, J., Lin, R.-S., Yang, M.-H.: Incremental learning forrobust visual tracking. Int. J. Comput. Vis. 77(1), 125–141 (2008)

34. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Discrimi-native learned dictionaries for local image analysis. In: Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1–8 (2008)

35. Zhang, Q., Li, B.: Discriminative k-svd for dictionary learning inface recognition. In: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 2691–2698 (2010)

123

J. Zhan et al.

Jin Zhan is a Ph.D. candi-date of the National Engineer-ing Research Center of Dig-ital Life, School of Informa-tion Science and Technology,Sun Yat-sen University. She isalso a teacher in GuangdongPolytechnic Normal University.She received her master’s degreefrom School of Software, SunYat-sen University in 2004. Herresearch interests include com-puter vision, video processing,machine learning and digitalhome applications. Contact her at

[email protected].

Zhuo Su is a Ph.D. candi-date of the National Engineer-ing Research Center of DigitalLife, School of Information Sci-ence and Technology, Sun Yat-sen University. He received hismaster’s and bachelor’s degreein software engineering fromSchool of Software, Sun Yat-sen University in 2010 and 2008.His research interests includeimage processing and analy-sis, computer vision, and com-puter graphics. Contact him [email protected].

Hefeng Wu received his B.S.and Ph.D. degrees from Sun Yat-sen University in 2008 and 2013,respectively. He is currently aPostdoctoral Research Scholarin School of Information Sci-ence and Technology, Sun Yat-sen University, and is also amember of National Engineer-ing Research Center of Digi-tal Life. His research interestsinclude image/video analysis,computer vision, and machinelearning. Contact him at [email protected].

Xiaonan Luo is a professor ofSchool of Information and Sci-ence Technology, Sun Yat-senUniversity. He is the director ofNational Engineering ResearchCenter of Digital Life and thedirector of Digital Home Stan-dards Committee on InteractiveApplications of China Electron-ics Standardization Association.He won the National ScienceFund for Distinguished YoungScholars granted by the NationalNature Science Foundation ofChina. His research interests

include Digital Home Technology, Mobile Computing, ComputerGraphics & CAD and 3D CAD.

123

robust tracking via discriminative sparse feature selection

Documents