multihuman tracking based on a spatial–temporal

7/27/2019 Multihuman Tracking Based on a SpatialTemporal

1/13

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 361

Multihuman Tracking Based on a SpatialTemporal

Appearance MatchYuan Shen and Zhenjiang Miao, Member, IEEE

AbstractIn this paper, we focus on the improvements ofappearance representation for multihuman tracking. Many pre-vious methods extracted low-level appearance features, suchas color histogram and texture, even combined with spatialinformation for each frame. These methods ignore the temporaldistribution of features. The features of each frame may notbe stable due to illumination, human pose variation, and imagenoise. In order to improve it, we propose a novel appearancerepresentation called the spatialtemporal appearance modelbased on the statistical distribution of Gaussian mixture model(GMM). It represents the appearance of a tracklet as a wholewith dynamic spatial and temporal information. The spatial

information is the dynamic subregions. The temporal informationis the dynamic duration time of each subregion. Each subregionis modeled as the weighted Gaussian distribution of GMM.The online expectation-maximization (online EM) algorithm isused to estimate the parameters of GMM. Then, we proposea tracklet association method using Bayesian prediction andJensenShannon divergence. The Bayesian prediction is used topredict the locations of targets. The JensenShannon divergenceis used to compute the distance of spatialtemporal appearancedistribution between two tracklets. Finally, we test our approachon four challenging datasets (TRECVID, CAVIAR, ETH, andEPFL Terrace) and achieve good results.

Index TermsJensenShannon divergence, multihumantracking, online EM, spatialtemporal appearance.

I. Introduction

MULTIHUMAN tracking in complex environments has

become more and more important in the field of com-

puter vision research. It has many applications, such as video-

based surveillance and humancomputer interaction. Its aim is

to locate targets, retrieve their trajectories, and maintain their

identities through a video sequence. The main challenging

problem is the frequent occlusions of targets in crowded

scenes.

Manuscript received November 7, 2012; revised February 23, 2013, May 26,2013, and July 12, 2013; accepted August 2, 2013. Date of publicationAugust 29, 2013; date of current version March 4, 2014. This work wassupported in part by NSFC 61273274 and NSFB 4123104, in part by the 973Program 2011CB302203, in part by the National Key Technology Researchand Development Program of China under Grant 2012BAH01F03, in part bythe Ph.D. Programs Foundation of Ministry of Education of China under Grant20100009110004, and in part by the Tsinghua-Tencent Joint Laboratory forIIT. This paper was recommended by Associate Editor C. Shan.

The authors are with the Institute of Information Science, Beijing Jiao-tong University, Beijing 100044, China (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2013.2280073

In order to solve the challenging problem of human tracking,

the classical tracking methods mainly follow the framework

based on particle filter. However, it is difficult to track targets

with long time full occlusions in crowded scenes since there

are no observations to guide the trackers. In recent years, due

to the improvements of human detection performance, tracking

by global association has become more and more popular.

This scheme has a general framework to track targets. It links

detection responses in consecutive frames to build tracklets,

which are short tracks for further analysis. Then, an association

algorithm is used to associate tracklets for final trackingresults. Considering the information from future frames, some

detection errors, such as missed detections and false alarms,

can be corrected, and the long time full occlusions can also

be solved.

Most of the global association methods fuse several features

as the affinity measurement [1][3], such as appearance, mo-

tion, position, and size. They always used filter-based methods

to extract motion, position, and size, but still used low-level

image features to represent the entire human appearance, such

as color histogram and texture. Some appearance features with

spatial information are proposed to improve low-level image

features [4], [5]. Though spatial information can improve low-

level image features in partial occlusions, the state-of-the-art methods ignore an important case about full occlusions.

The existing methods always use the latest appearance model

learned by online update to track targets, and discard earlier

appearance gradually. When a target is fully occluded for a

long time and reappears, it is difficult to estimate whether

its appearance is more similar to the latest or the earlier

appearance model. If the target appearance is more similar to

the earlier appearance model, the tracker would drift based

on the latest appearance model to measure the similarity

before and after occlusions. A simple way is to collect the

latest and earlier appearance features of a target into a set.

With the increase of frame steps, there would be a lot of

feature samples. It is time consuming to search the most

similar appearance features in such a large sample space.

In order to record the latest and earlier appearance features,

and maintain the spatial and temporal information of these

appearance features including their spatial layout and temporal

order, we need to explore a new appearance representation.

Based on the above motivation, in this paper, we propose a

new appearance model called spatialtemporal appearance in

the field of multihuman tracking and use this new appearance

model to track multitarget. We still use the tracking framework

1051-8215 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.


2/13

362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

of global association. For each tracklet, we perform the auto-

clustering method based on online expectation-maximization

(online EM) algorithm to cluster the appearance feature in

space and time. We adopt RGB color space to represent

human appearance. Pixels with similar colors are clustered

into the same class in space and time. We apply the Gaussian

mixture model (GMM) to represent the classes of the tracklet.

Each class is a subregion of appearance and modeled by the

Gaussian distribution which represents the color distribution,dynamic spatial layout, and time duration of subregions. The

appearance of each tracklet is represented as spatialtemporal

statistical distribution. Based on this distribution of appear-

ance, we can obtain the stable appearance feature, which is

not interfered by pose, illumination variation, image noise,

and so on.

To the best of our knowledge, the main contribution

in this paper is the novel appearance representation called

spatialtemporal appearance. It not only records the dynamic

spatial layout of appearance, but also maintains the dynamic

duration time of each subregion of appearance including the

latest and the earlier frames of the whole tracklet. This

appearance model can provide more information than that of

previous methods for tracklet association. In order to associate

tracklets for final tracking results using this spatialtemporal

appearance model, we propose a tracklet association method

using Bayesian prediction with fuzzy search range, and use

JensenShannon Divergence to compute the similarity of

spatialtemporal appearance.

The rest of the paper is organized as follows. Related work

is discussed in Section II. The overview of our approach is

given in Section III. The spatialtemporal appearance and

tracklet association are presented in Section IV. Section V

shows some implementation details. The experimental results

and discussions are shown in Section VI. Some conclusions

are given in Section VII.

II. Related Work

Object tracking is a hot research field in computer vision

for many years. Many methods have been proposed. The early

works are multihypothesis tracking (MHT) [6] and joint proba-

bilistic data association filters (JPDAF) [7]. MHT enumerated

all possible hypotheses of the target and selected the most

likely hypothesis as its optimal solution. With the increase

of the number of targets and time steps, the original MHT

method will encounter difficulties in computational cost of

hypotheses. The JPDAF method maintained a joint probabilityamong tracking targets in each frame. When new targets enter

the field of camera view or old targets leave the view, the joint

probability needs to be recomputed.

In recent years, the particle filter [8] is a widely used

framework due to its robust performance. Many improved

methods have been proposed. Some methods aim at the

combination of particle filter and detection results. Okuma

et al. [9] combined particle filter with the Ada-boost detection

results to track an unknown number of objects. Li et al. [10]

used multiple detectors to form a cascade particle filter to

enhance the computational speed. The order in which the

detectors were applied was determined based on their com-

putational costs: the faster the earlier. Breitenstein et al. [11]

proposed the continuous confidence of pedestrian detectors,

and used it as a graded observation model to guide particle

filter trackers. Yang et al. [12] used detection responses to

update trackers and extracted multicue features to track targets

including color model, elliptical head model, and bags of

local features. Other methods focus on the improvements

of sampling efficiency. Shan et al. [13] and Cai et al. [14]embedded the mean shift [15] algorithm into particle filter to

improve sampling efficiency of particles to track hands and

multiple persons, respectively. Khan et al. [16] improved the

mean shift embedded method by using multimode anisotropic

mean shift. The particle filterbased tracking methods are

suitable for online applications since their results are only

based on the past frames. These methods do not consider the

information of future frames. When targets are fully occluded

for a long time, these approaches may yield identity switches

or trajectory fragments since there are no observations to guide

the trackers.

In contrast to these methods, which only consider the past

information, many global data association approaches have

been proposed. Global data association considers not only past

frames but also future frames. These methods track targets and

deal with occlusions by finding the best matches before and

after occlusions. They build tracklets based on the detection

responses in consecutive frames and perform association al-

gorithms on these tracklets for final tracking results. Some

researchers call them tracking by tracklet association.

Huang et al. [1] presented a hierarchical association ap-

proach. They built reliable tracklets based on object position,

size, and color histogram of appearance, and used Hungar-

ian algorithm to associate tracklets based on these features.

Finally, they built an entry and exit map to specify the initial-

ization/termination of each tracklet in the scene to enhance

the performance of data association. Xing et al. [2] used

particle filter to refine tracklets and used Hungarian algorithm

to associate tracklets based on color histogram of appearance,

size, and motion of targets. Henriques et al.[17] added merges

and splits measurement of targets to improve the Hungarian

association. Wu et al. [18] used network flow to associate

tracklets only based on motion features. Brendel et al. [19]

formulated the network flowbased tracklet association as

the maximum weight independent set problem, and applied

linear programming to solve it. The methods in [20], [21],

and [22] detected body-parts in tracklets to extract local

appearance features and applied Viterbi, greedy algorithmand network flow to associate tracklets, respectively. Some

researchers applied machine learning algorithm to solve the

tracking problem. Li et al. [23] extracted multiple features to

build a feature pool, including color histogram, tracklet length,

motion and so on, and presented a HybridBoost algorithm to

learn the affinity models between two tracklets. The method

in [3] added pairwise features to improve the feature pool

of [23] and presented CRF-based tracklet affinity models.

Kuo et al. [24], [25] learned an Ada-boost appearance model

to distinguish targets. The tracklet association of [24] was

based on the Ada-boost appearance model. The method of


3/13

SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 363

[25] improved the association of [24] by adding motion and

time gap features and was solved by Hungarian algorithm.

Yang et al.[26] improved the Ada-boost appearance model of

[24] and [25] using multiple instance learning and proposed

nonlinear motion pattern-based tracklet association, which was

solved by Hungarian algorithm.

All of these tracklet associationbased methods mainly

focus on the performance improvements using different as-

sociation algorithms. They neglect the importance of featurerepresentation, especially in appearance features. They always

use filter-based methods, such as the Kalman filter [27],

to extract features of motion, position, etc. For appearance

features, they only extract low-level features from the entire

human, such as color histogram, and texture. These appearance

features may not work well in the case of partial occlusions,

illumination variations and so on.

In order to improve appearance representation, some new

methods are proposed by combining spatial information. The

methods in [4], [28], and [29] divided the entire tracking

region of each frame into a fixed number of subregions. The

tracking of the entire object is converted into estimating the

similarity of each subregion. Kalal et al. [30] built ensemble

classifier to track targets based on the PN learning method.

They generated pixel comparisons offline at random and

stayed fixed in runtime. These pixel comparisons recorded the

pixel locations and feature distance. Besides the fixed spatial

information, some methods with dynamic spatial information

are proposed. Fan et al. [31] proposed a dynamic subregion

method, which is called attentional regions (AR) to track

targets. Local ARs were searched based on gradient and

identified based on branch-and-bound procedure to determine

the target location. Low similarity ARs would be removed

and replaced by new ARs. Birchfield et al. [32] proposed a

spatiogram appearance model to track targets based on mean

shift. The spatiogram contained the spatial means and covari-

ances for each color histogram bin. Wang et al. [5] proposed a

spatialcolor mixture of Gaussians (SMOG) appearance model

for particle filters. The appearance model was represented as a

fixed number of Gaussians in runtime. Spatial information and

color distribution were computed for each model of SMOG.

Most of the proposed methods use the definition of fixed

spatial layout [4], [28][30]. It cannot satisfy the dynamic

requirements especially in nonrigid targets, such as pose

changes. Though some methods with dynamic spatial infor-

mation [5], [31], [32] improve the fixed spatial layout, these

methods still have their own weaknesses. They always use

the latest appearance data of objects to update the appearancemodel. With the increase of frame steps, the model will

forget the earlier appearance gradually. When a target is fully

occluded for a long time, the appearance models of these

methods stop updating during occlusions since there are no

observations to update appearance models. Due to the complex

of real scenes, the target appearance may have some variations

when it reappears after occlusions. This may be caused by

illumination variations, even the changes of camera perspective

due to the target movements. In this case, it is difficult for these

methods to estimate the appearance similarity of the target

using the latest appearance model. If the target appearance is

more similar to the earlier appearance model, the similarity

measurement of the latest model would fail.

Our spatialtemporal appearance model will improve the

appearance representations, which are mentioned above, and

promote the tracking performance. Our appearance model not

only provides the dynamic spatial layout of appearance of

each target including the dynamic number and locations of

subregions, but also provides the dynamic duration time of

each subregion. The temporal distribution of appearance notonly records the latest appearance model, but also records

the earlier appearance model. We can dynamically select the

most similar appearance model to associate tracklets for final

tracking results.

Our association method uses the Bayesian motion prediction

with fuzzy search range to guide the appearance association

of tracklets. Compared with MHT, our method only predicts

the most likely motion direction instead of all possible pathes

which need more computational cost. In order to compensate

the imprecise prediction position, we add the fuzzy search

strategy.

III. Overview of Our Approach

We adopt the framework of tracking by tracklet association

in our approach. This framework can be mainly formalized as

(1). L is the set of association results L = {l1,...,lN}. Each

element of L represents whether two tracklets can be associ-

ated.Sis the set of tracklets S={TR1,...,TRM}.f()is a cost

function to associate tracklets. Equation (1) represents that the

association results are the best matches of tracklets subject to

the nonoverlap restriction. The nonoverlap restriction means

that two tracklets can not have the overlapping duration time,

and two association results can not share the same tracklet

L

= arg maxL f(L|S)

Subject to

T Ri T Rj=,i, j M

lp lq =,p, q N.

(1)

In previous methods, such as in [1][3], [23][26], each

tracklet TRi is represented by a sequence of detection re-

sponses TRi =

rts

i ,...,rte

i

, where rt

s

i and rte

i denote the start

and end frames of tracklet TRi. The features of each tracklet

are extracted from each detection response independently. For

example, each response is represented as rti =

ati, pti, s

ti, v

ti

in frame t, where a ti is the appearance, p

ti is the position, s

ti is

the size, and vti is the velocity.

In this paper, we improve the feature extraction of appear-

ance by building spatialtemporal appearance model for eachtracklet instead of appearance extraction from each response

in each frame independently. Therefore, in our method, the

complete representation of TRi is TRi =

Gi,

rts

i ,...,rte

i

,

where Gi is the spatialtemporal appearance model of the

whole tracklet TRi. Then, Gi is used to associate tracklets.

The flow chart of our approach is shown in Fig. 1. First,

reliable tracklets are built from the detection responses. Then,

there are two main components. One is the spatialtemporal

appearance model that is built from each tracklet. We extract

spatialtemporal appearance based on color feature. Similar

colors are clustered in space and time based on online EM


4/13


Fig. 1. Overview of our approach. In the framework of tracking by tracklet association, a new appearance representation called spatialtemporal appearancemodel is proposed and used to associate tracklets.

algorithm. The classes of the tracklet are formulated as a

GMM. Each class corresponds to a Gaussian distribution.

The appearance of each tracklet is modeled as a whole

rather than independently for each frame. The other is the

tracklet association. We predict the locations of targets in the

current frame to obtain the motion cost between two tracklets,

and apply Gaussian selection algorithm to the GMM of

spatialtemporal appearance to find the GMM subsets which

are the minimal appearance distance between two tracklets.

JensenShannon divergence (JSD) is used to compute the ap-

pearance cost between two tracklets. Spatialtemporal appear-ance cost and motion cost are used to associate tracklets for

final tracking results. The spatialtemporal appearance model

and the tracklet association are described in Sections IV-A

and IV-B, respectively.

IV. SpatialTemporal Appearance and

Tracklet Association

A. SpatialTemporal Appearance

Given the set of reliable tracklets, we start to extract

spatialtemporal appearance for each tracklet. In order to

achieve this goal, we apply GMM for this task. Each Gaussian

distribution represents spatialtemporal color distribution of

pixels which belong to the same class. Each Gaussian distri-

bution corresponds to a class. The weight of each Gaussian

model represents whether the Gaussian distribution is impor-

tant or not: the greater weight, the higher importance. Here, we

apply the online EM algorithm to estimate the parameters of

GMM rather than offline EM since the association of tracklets

is designed from previous frames until the current frame. With

the increase of frame steps, tracklets which are not terminated

in the current frame may grow in the future frames. The online

EM can satisfy this case when the data is being added for

each future frame to update the parameters of GMM. If we

use offline EM to estimate the parameters of GMM, with theaddition of new data of future frames, the parameters of GMM

must be recomputed completely using the new sample space

including old samples and new samples. This would waste lots

of time due to recomputing the distribution of all samples. That

means tracklets cannot compute the appearance distribution

using offline EM until they are terminated completely before

the current frame. Based on this reason, the online EM is more

appropriate. It can achieve reasonable statistical results with

the new data being added in the current frame. In order to

show how we use online EM to estimate the parameters of

GMM in our paper, we build Algorithm 1.

Algorithm 1: Learning GMM appearance model

Input: Tracklet TR= {rt}, initialize t at the beginning of TR,K = 0

repeat1for each pixel x

rti ofr t do2

if K = 0 then3Initialize a new Gaussian distribution; K K + 14

end5E-step: for k = 1to K do6

ik = N

xrti |k, k

7

end8

if max (ik)< 1 then9Initialize a new Gaussian distribution; K K + 1;10break

end11M-step: for k = 1to K do12

ik = ik

Km=1

im

; Ck Ck + ik

13

newk = (1 ik

Ck)oldk +

ikCk

xrti14

newk = (1 ik

Ck)oldk +

ikCk

(xrti

newk )(x

rti

newk )

T15

k = Ck

Km=1

Cm16

end17end18for a= 1to K do19

for b= 1to K do20if JSD (Na||Nb) < then21

ab = CaCa+Cb

a+ CbCa+Cb

b22

ab = CaCa+Cb

a+ CbCa +Cb

b23

ab = Ca+Cb

Kk=1

Ck24

Tab = max

Tenda , Tend

b

min

Tstarta , T

startb

25

end26end27

end28

Compute new end frame Tendk for each Gaussian Nk29

Tk =Tend

k Tstart

k30

t t+ 131until rt is the tail of TR ;32Output: GMM appearance distribution includes K Gaussian

distributions

First, we need to define some important variables before

showing the algorithm of building spatialtemporal appear-

ance. Suppose G is the GMM appearance distribution for

a tracklet TR. Nk is one of the Gaussian distributions in

the GMM. G = {Nk}, k is the index of each Gaussian

distribution. For each Gaussian distribution Nk, we define

Nk = {k, k, k, Tk}. k is the weight of the Gaussian

distribution Nk. k is the mean. k is the covariance matrix.


5/13


Finally, Tk is the duration time of the Gaussian distribution

Nk. For k and k, they still include five parameters. The

five parameters are position x and y which are normalized by

the width and height of the human detection bounding box,

color channelsR,G, andB, which are normalized by the value

range of RGB color space. They are independent distributions.

This is shown in (2). diag{} denotes a diagonal matrix

k ={kx, ky, kR, kG, kB}T

k =diag

2kx, 2ky,

2kR,

2kG,

2kB

. (2)

Here, we must further explain the definition of variables. For

the position distribution, we use {kx, ky, kx, ky} to model

it. Since there are some displacements for the location of the

same class in each frame as a class crosses several frames, it is

difficult to use a fixed position range to describe it. Due to this

reason, we use Gaussian distribution to model the position of

the class. In addition, the position of each class is the relative

position in the detection bounding box, rather than the absolute

position in the image.

For the duration time of each Gaussian, it is used to

constrain whether Gaussians can be a subset to compute the

similarity with another subset. Only Gausians with overlapping

duration time can be a subset. We do not use Gaussian

distribution to model it since the variance cannot reflect the

real duration time of each Gaussian. For example, the duration

time of Gaussian a is from frame 1 to 10. The duration time of

another Gaussianb is from frame 6 to 30. These two Gaussians

have the overlapping time of five frames. If we use Gaussian

to model the time dimension, the time distribution of Gaussian

a is at= 5.5, at= 2.87, the time distribution of Gaussian b

isbt= 18,bt= 7.2. It would be difficult to estimate whether

two Gaussians are overlapping or not using the time variance.

The algorithm of building spatialtemporal appearance is

shown in Algorithm 1. The input of this algorithm is a

trackletTR. r tis the detection response ofTR in frame t. We

initialize the frame index tat the beginning of tracklet TR. The

number of Gaussian distributionK is initialized to 0 before the

algorithm is computed. For each tracklet TR, we compute the

parameters of GMM distribution using online EM algorithm

based on the detection response of each frame until the tail

of tracklet TR. For each detection response rt of tracklet

TR, we compute the similarity of each pixel of response

rt with each Gaussian distribution and select the maximal

similarity. If the maximal similarity is less than threshold 1,

we initialize a new Gaussian distribution in GMM. Otherwise,

all Gaussian distributions will be updated based on the online

EM algorithm. From line 6 to 11 is the E-step of online EM.From line 12 to 17 is the M-step of online EM. In line 13,

the similarity ik of each Gaussian distribution is normalized

to ik and summed into Ck to form the final update factor ik

Ck.

The componentik of this update factor updates Gaussians by

a proportion of their estimated posterior probability in each

frame. The component Ck guarantees the stability of GMM

parameters when new samples are being added, due to the

accumulation of a large number of samples. After the online

EM in each frame, we need to check whether two Gaussian

distributions should be merged. Here, we use JSD [33] to

estimate the distance between two Gaussian distributions. The

Fig. 2. Illustration of spatialtemporal appearance model. Each ellipsoidrepresents a Gaussian distribution of appearance model and the duration timeof Gaussian distribution.

definition of JSD distance will be described in (6) and (7)

of Section IV-B. When the distance between two Gaussian

distribution a and b is less thanwhich is a very small number,

we merge them based on the amount proportion of Ca and

Cb. This is shown from line 22 to 24. The duration time ofGaussian distribution a and b is still merged based on line 25.

Finally, in lines 29 and 30, we update the duration time of

each Gaussian distribution, and repeat the above algorithm

for the next frame t+ 1 until the tail of tracklet TR. Based

on this algorithm, the schematic diagram of spatialtemporal

appearance is shown in Fig. 2.

B. Tracklet Association

After building the spatialtemporal appearance model for

each tracklet, we start to associate them for final tracking

results. The main idea of our tracklet association is based on

the motion prediction of Bayesian methods, since the motion

of targets in the current frame can be predicted based on the

motion of several previous frames. In order to implement this

strategy, in computer vision research, a popular prediction

tool called Kalman filter [27] is typically used to predict

the location of target. Kalman filter is a linear system with

Gaussian noise based on the Markov chain, it predicts the

target location only based on the information of last frame. By

repeating the Kalman filter frame by frame, we can predict the

location of the target. When a target is occluded, the reliable

tracklet of the target is terminated. Then, based on the latest

Kalman state of this tracklet, we can predict the approximate

location of the target in the future frames when the target

is occluded. However, the linear motion prediction may beimprecise in long frame gaps due to the nonlinear motion [34].

We will propose a strategy to alleviate it in the following part.

When we obtain the prediction location of the occluded

target, in the normal case, we always search the target around

the prediction location to track it continually. However, the

range of around the prediction location is difficult to define

precisely by a fixed boundary. We propose a fuzzy search

method based on exponential distribution instead of the fixed

boundary. The exponential distribution can be modeled by

d(Lm,Ln). is the base of the exponential distribution. d()

is the normalized Euclidean distance between the prediction


6/13


location Lmand the target location of another tracklet Lnin the

current frame. With the increase of search radius, the location

distance is larger and larger. That means the probability of

finding the target is lower and lower with the increase of search

radius

Sm,n =JSD (Gm||Gn) d(Lm ,Ln). (3)

In order to associate tracklets to form the final trackingresults, we still need to use the spatialtemporal appearance

model to measure the appearance similarity between tracklets.

The association of tracklets can be represented as (3). Sm,nrepresents the similarity distance between two tracklets. If

Sm,n is lower than any other tracklet pairs, the two tracklets

can be associated. This can be solved by the Hungarian

algorithm [35] in our approach. In fact, for the efficiency of

implementation, the approach needs a time sliding window on

the video sequence to compute the smallest distance of Sm,n,

it does not process the whole video together. JSD () is the

JSD to compute the similarity between two spatialtemporal

appearance models. Gm and Gn are two GMM distributions.

We will describe the details of JSD in the following part. The

parameterd(Lm ,Ln) is the exponential distribution of the fuzzy

search method. In order to estimate whether two tracklets can

be a pair to compute the similarity, the nonoverlap restriction

which is shown in (1) must be employed. This restriction is

reasonable, since a person cannot belong to two tracks or

appear in two places at the same time.

The distance between two GMM distributions can be com-

puted by using their weight, mean, and variance instead of

comparing each sample pair. This can be solved by using

JSD or KullbackLeibler divergence (KLD) [36]. We employ

JSD to compute the similarity between two GMM distribu-

tions instead of KLD, since KLD is a nonsymmetric and

unnormalized distance. KLD is in the range of [0, +). The

distance of KLD (Gm||Gn) is different from the distance of

KLD (Gn||Gm). However, JSD is in the range of [0, 1], and

it is a symmetric distance. Therefore, JSD is more suitable to

evaluate the distance between two GMM distributions.

The similarity between two GMMs cannot be still computed

directly since different order of Gaussians in GMMs will lead

to different distance. We need to explore a strategy to find the

smallest distance between two GMMs. The smallest distance

will lead to the strictest comparison in the global tracklet

association. Furthermore, the appearance of a person must

appear as a whole. His subregions of appearance must appear

with overlapping duration time, otherwise, the combinationof subregions would be meaningless. In order to satisfy the

above requirement, which is the smallest distance between

two GMMs under the restriction of overlapping duration

time, we present Algorithm 2 to select Gaussian distributions

from two GMMs to compute the similarity. The selection of

this algorithm is mainly based on the parameter Tk in each

Gaussian distribution.

The selection algorithm of Gaussian distribution is shown in

Algorithm 2. We first input two GMM distributions ( Gm and

Gn), which belong to two tracklets, respectively, and initialize

two sets A and B to store the selected Gaussian distributions

Algorithm 2: Gaussian selection from GMM

Input: Two GMM distribution Gm and Gn, Initialize two setsA= B =

for each Nm,i of Gm do1for each Nn,j of Gn do2

mi,nj = JSD

Nm,i||Nn,j

3

end4end5

(i, j)= argmini,j

mi,nj

6

Add Nm,i to A, Add Nn,j to B7repeat8

Initialize two sets AT = and BT =9for each Nm,i of Gm,Nm,i /A and Tm,i TA = do10

Add Nm,i to AT11for each Nn,j of Gn, Nn,j /B and Tn,j TB = do12

Add Nn,j to BT13

mi,nj = JSD

Nm,i||Nn,j

14

end15end16if AT= and BT= then17

(i, j) = argmini,j

mi,nj

18

Add Nm,i to A, Add Nn,j to B19

end20 if (AT= and BT = ) or (AT = and BT=) then21

for each Gaussian NATp (orNBTq ) of AT (orB T) do22

Find the Gaussian NB (or NA) in set B (or A) with23the minimal JSD distanceAdd NATp (or N

BTq ) to A (or B )24

Repeat NB (or NA) in B (or A)25end26

end27until A and B do not increase ;28Normalize the weights of Gaussians in set A and B , respectively29Output: The set A and B, they are called as GMM

distribution Gm and Gn, respectively

The Gaussian number of Gm and G

n are K

m and K

n,respectively

K

m = K

n, we call them K

from Gm andGn, respectively.Gm is a GMM distribution that

belongs to a terminative tracklet. Gn is a GMM distribution

that belongs to an initialized tracklet. First, from line one to

seven, we compute the similarity of each pair of Gaussian

distributions in Gm and Gn, and store the minimal distance

pair

Nm,i , Nn,j

to the set A and B, respectively. Add Nm,i

to A, add Nn,j to B . Then, from lines 8 to 28, we repeat the

selection process until the set A and B do not increase.

In lines 10 and 12 that have the same meaning, the con-

ditions of loop are changed compared with line 1 and 2.

For line 10, Nm,i does not belong to the set A before itcomputes the similarity. TA represents the duration time of

each Gaussian distribution in set A. Tm,i TA = represents

that the duration time Tm,i ofNm,i overlaps the duration time

of each Gaussian distribution in set A, this is illustrated in

Fig. 3(a). This condition ensures the Gaussian distributions

in set A (or B) appear at the same time since the integrated

appearance of a person must appear as a whole. However, the

Gaussian distributions of the original input Gm (or Gn) may

not appear simultaneously. Just like Fig. 3(b), some Gaussian

distributions do not appear simultaneously. From line 17 to 27,

we select the minimal distance pair to store them. When


7/13


Fig. 3. Slection of Gaussian distribution

AT = and BT = , we can directly compute the minimal

distance of Gaussian distributions, and store them into A and

B. It is shown from line 17 to 19. In line 21, if one of these

two sets is empty set, we need to follow the line 22 to 26 to

find a match. This is very important. If a Gaussian distribution

in ATcan not find a match in BT, it will be discarded. This

leads to missing features. We need to repeat some Gaussian

distributions in set B to match them. Finally, in line 29, we

normalize the weights of Gaussian distributions in set A and

B, respectively.After the selection process, two setsAandB are output. The

Gaussian distributions in set A are called as GMM distribution

Gm. The Gaussian distributions in set B are called as GMM

distribution Gn. The Gaussian distribution number ofGm and

Gn are Km and K

n, respectively. Since K

m = K

n, we call

them K.

Algorithm 2 shows the dynamic selection algorithm of

Gaussian distribution for the minimal GMM distance. The

dynamic selection may select the GMM appearance which

appeared the earliest, or the latest, as long as the appearance

distance between two tracklets is minimal.

After the selection of Algorithm 2, the distance between

two GMM distributions can be computed by (4). m,k andn,k are the weights of Gaussian distribution Nm,k and Nn,k .

Equation (3) can be modified by (5)

JSD

Gm||Gn

=

Kk=1

m,k +n,kK

k=1

m,k +n,k

JSD Nm,k ||Nn,k (4)

Sm,n = JSD

Gm||Gn

d(Lm ,Ln). (5)

Finally, we need to compute the JSD distance between two

single Gaussian distributions. In order to understand easily,we assume two single Gaussian distributions N1 and N2to compute the JSD distance. N1 corresponds to Nm,k , N2corresponds to Nn,k . The definition of JSD in line 21 of

Algorithm 1 is the same as the descriptions of this part. N1and N2 correspond to Na and Nb in Algorithm 1, respectively

JSD (N1||N2)

=1

2

KLD

N1||N

+KLD

N2||N

(6)

N =1

2(N1+ N2) . (7)

The general form of JSD is shown in (6) and (7). N1 =

{1, 1, 1} and N2 = {2, 2, 2} are two Gaussian dis-

tributions to compute the JSD. KLD () is the KLD. N =

{, , } is the mixture distribution ofN1 and N2. In paper

[37], N is computed by (8) and (9) without Gaussian weight

=1

2(1+ 2) (8)

=

1

2

1+ 2+

T

1 1+

T

2 2

T

. (9)Since our spatialtemporal appearance model is based on

GMM distribution, the weight of each Gaussian distribution

is still important. We modify (8) and (9) with the weight of

Gaussian distribution. It is shown in (10), (11), and (12). KLD

is shown in (13) and (14). Here, we compute the divergence of

KLD

N1||N

as an example, the KLD

N2||N

is the same

as it. P is the normalized Gaussian distribution (

xP(x) = 1)

and dim is the dimension of the Gaussian distribution

=1

2(1+ 2) (10)

=(11+ 22)

1+ 2

(11)

=1

1+ T1 1

1+ 2

+

2

2+ T2 2

1+ 2

T

(12)

KLD

N1||N

=

x

N1(x) lnN1(x)

N (x)

=

x

1P1(x) ln1P1(x)

P (x)

=1ln1

x P1(x) +1 x P1

(x) lnP1(x)

P (x)

=1ln1

+1

x

P1(x) lnP1(x)

P (x).

(13)

ReplacingPwith Gaussian distribution function, we obtain

KLD

N1||N

=1ln1

+1

1

2

ln

|1|

+tr

1

dim

+

1 T

1

1

.

(14)

V. Implementation Details

A. Building Reliable Tracklets

We borrow the idea of [1] to build reliable tracklets for

the same reasons in [3] and [23][26], since it is a simple

and conservative method. This approach considers that the

changes of the target are very small in two consecutive frames

including position displacement, the changes of appearance

and so on. The affinity is formalized as

rt = arg m in

d

Lrit

,Lrt1


8/13


frame t. Lrt1 represents the center location of the observation

rt1. d() is the Euclidean distance. B () is the Hellinger

distance of the color histograms between two observations, it

is defined as

1 BC

rit, rt1

, BC () is the Bhattacharyya

coefficient. represents the range of the neighborhood posi-

tion. In (15), we compute the appearance distance between two

detection responses in the neighborhood position ofrt1, and

link the response of minimal distance rt to the corresponding

tracklet. This strategy is conservative and biased to link only

reliable association between any two consecutive frames. In

order to prevent unsafe association of responses, in [1], a

boundary value of the affinity 2 is defined. In other words,

two responses are linked if and only if their affinity distance

is lower than the threshold 2 and significantly lower than the

distance of any other pairs. In our method, the boundary value

2 = 0.3 since the Hellinger distance is in the range of [0, 1].

Since the human detector only gives the approximate size of

detection responses, the size may be larger or smaller than the

size of the ground truth target. In order to extract accurate

features for the tracklet association, tracklet refinement is

needed. Let st

is the size of an observation rt

in a tracklet

TR, where t is the index of frames. The refinement can be

formalized as

st=

i=1

sti

. (16)

is a time sliding window in the refinement filter, and is set

to five. The size of the detection responsertis refined based on

the average size of previous frames. By repeating this equation

until the tail of the tracklet, the size of each detection response

in the tracklet can be smoothed.

B. Entry and Exit Model

In order to use Hungarian algorithm to compute the optimal

association results of (3), the entry/exit model of scenes must

be defined to specify the initialization/termination of each

tracklet in the camera view. The entry/exit model includes

two parts in our approach. One is the boundary of the camera

view, the other is the entry/exit zones in the camera view.

The boundary of the camera view is easy to specify the

initialization/termination of a tracklet. When a target enters the

image from the boundary of the camera view, its tracklet is

initialized. When a target leaves the image from the boundary

of the camera view, its tracklet is terminated. The entry/exit

zones in the camera view need to be learned.

We use our previous work [38] to finish this task. Thiswork is a trajectory analysis research that can also learn the

entry/exit zones. In [38], each trajectory is represented as

a sequence of key points and start/end points. The features

of each key point are coordinates, turning angle (TA), and

turning angle direction (TAD). For the start/end points, the

features are only coordinates. Key points are used to learn

the classification of trajectories. The start/end points are used

to learn the entry/exit zones of scenes. In our approach, only

coordinate features of start/end points of each trajectory are

extracted without key points, since we only use the method of

[38] to learn the entry/exit zones without the task of trajectory

classification. Start/end points of trajectories are clustered

using unsupervised EM algorithm. Each class is modeled as

Gaussian distribution, and represented as

x, y, 2x ,

2y

. x

and y are the means of coordinates x and y. 2x and

2y

are their variances. Finally, the classes of start/end points of

trajectories are the entry/exit zones.

VI. Experiments

A. Dataset and Test Method

1) Dataset: We use four challenging datasets to test our

approach. These four datasets are the TRECVID 2008 [39],

the CAVIAR dataset [40], the ETH Mobile Platform dataset

[41], and the Terrace sequence of EPFL dataset [42], [43]. All

of these datasets include many mutual occlusions in crowded

scenes.

Many state-of-the-art human detectors can be adopted to

detect human responses, such as [44][46] and so on. For

TRECVID, ETH, CAVIAR datasets, to compare with state-

of-the-art tracking approaches fairly, we adopt the human

detection results that are used in [25] since these detectionresults are also used in [1], [3], [23], [24], and [26]. For the

Terrace sequence, we use the method of [44].

The TRECVID 2008 dataset is hundreds of hours from

five fixed cameras covering different field-of-views. We follow

the data setting of the paper [23] since they first used this

dataset. Their setting is also used in [1], [3], and [24][26]. The

CAVIAR dataset is captured in a shopping center corridor by

two fixed cameras from two different viewpoints and contains

26 video sequences. We follow the setting of [25] which

selected the 20 videos only using the corridor view. Their

setting is also used in previous works, such as [1], [23], [24],

and [26]. The ETH dataset is captured by a stereo pair of

forward-looking moving cameras in a busy street scene. Due

to the lower position of the cameras, full occlusions also

often happen in these videos. We follow the dataset setting

proposed in [25] and its ground truth. In [25], only left view of

videos was used without stereo depth maps. The EPFL Terrace

sequence uses four cameras with an overlapping camera view.

The objects appear very long time, and their appearance is

varying more often. Since our method only uses a monocular

camera, we only use the data of camera 0 to test our approach.

The details of these datasets are shown in Table I.

2) Test Method and Evaluation Metrics: We conduct four

experiments to evaluate the effectiveness of our approach.

The first experiment tests how the parameter value affectsthe tracking performance. The second experiment compares

our approach with several state-of-the-art methods on the

TRECVID, ETH, and CAVIAR dataset. The comparison

methods include [1][3], [23][26] since these methods are

the latest results and can reflect the best performance of

tracking by tracklet association. The third experiment tests the

robustness of our approach using the terrace sequence of EPFL

dataset, since the objects appear hundreds even thousands of

frames, and the appearance of object is varying more often in

these sequences. The fourth experiment is the computational

cost of our approach.


9/13


TABLE I

Dataset

TABLE II

Discussion of Parameter Values

In our experiments, to compare with state-of-the-art ap-

proaches fairly, we adopt the metrics used in [23] to eval-

uate the tracking performance since all of these comparison

methods follow the metrics.

The metrics in [23] are an improved version of the original

metrics in paper [47]. In [23], the track fragments and ID

switches are more strict but better defined than the original

definition in [47]. The metrics are as follows.

1) Ground truth (GT): The number of trajectories in the

ground truth.

2) Mostly tracked trajectories (MT): The percentage of

trajectories that are successfully tracked for more than

80% divided by GT.

3) Partially tracked trajectories (PT): The percentage of

trajectories that are tracked between 20% and 80%

divided by GT.

4) Mostly lost trajectories (ML): The percentage of trajec-

tories that are tracked for less than 20% divided by GT.

5) Fragments (Frag): The total number of times that a

trajectory in GT is interrupted.

6) ID switches (IDS): The total number of times that a

tracked trajectory changes its matched GT identity.

Since multiobject tracking can be viewed as a method that

is able to recover missed detections and remove false alarms

from the raw detection responses, the metrics for detection

evaluation are provided.

1) Recall: The number of correctly matched detections

divided by the total number of detections in ground truth.

2) Precision: The number of correctly matched detections

divided by the number of output detections.

3) False Alarm per Frame (FAF): The number of false

alarm per frame.

The higher value, the better is the performance in MT, recall

and precision, the lower value, the better is the performance

in PT, ML, Frag, IDS, and FAF.

B. Performance

In the first experiment, we test how the parameter value

affects the tracking performance on the TRECVID, ETH,

and CAVIAR dataset. We only use tracking metrics MT,

PT, ML, Frag, IDS to evaluate this experiment since the

metrics of human detection such as Recall, Precision, and FAF

cannot reflect the changes of tracking performance directly.

The results are shown in Table II. The best results are shown

using bold face. With the increase of the parameter , the

tracking performance goes up. When = 100, we obtain

the best performance on three datasets. If the parameter

increases continually, such as = 150, the performance on

three datasets cannot increase , even decrease. This is because

the parameter affects the search range of tracklet association.

With the increase of , the value of exponential distribution

increases faster and faster. If is too large, a small distance

will lead to a very large value of d(Lm ,Ln). This large value

will dominate appearance features and decrease the trackingperformance. From this experiment, = 100 is appropriate. In

the subsequent experiments, we will use = 100 to test our

approach. For other parameters, 1 = 0.1, it is in Algorithm 1,

2 = 0.3, = 5, they are in Section V-A.

In the second experiment, our method is compared with

state-of-the-art methods based on the complete metrics on

three challenging datasets TRECVID, CAVIAR, and ETH. We

show test results in Tables IIIV. From the three tables, the

metrics recall, precision, and FAF are very close to previous

methods. These metrics indicate the misses and false alarms

of detections. Even though some of them outperform previous


10/13


TABLE III

Comparison of Tracking Results on TRECVID Dataset

TABLE IV

Comparison of Tracking Results on CAVIAR Dataset. *The Number of Frag and IDS Followed the Metrics in [47]

TABLE V

Comparison of Tracking Results on ETH Dataset

methods, the enhancement is very limited. The first reason is

that we use the same detection results as previous methods.

The second reason is that our tracklet building method is simi-

lar to previous approaches that are based on the method in [1].

Furthermore, we do not propose new methods to recover de-

tections in this paper. Our method focuses on the performance

improvement of tracklet association for tracking metrics based

on the spatialtemporal appearance model. Therefore, if the

performance of human detectors or the methods of detection

recovery cannot be improved, the performance enhancement

of human detection is limited only relying on the refinement

of tracking algorithms. For other metrics, we analyze them as

follows.

Table III shows the tracking result on TRECVID dataset.

Our performance outperforms the previous methods. The MTis higher, and the PT and ML are lower than previous methods.

For the metrics Frag and IDS, our method is much less than

previous methods. Especially, though the methods of [25]

and [26] have reduced a lot of Frag and IDS, our method

still outperforms them. For the dataset CAVIAR and ETH,

though the enhancement of performance is not higher than

TRECVID dataset, we also obtain better performance than

previous methods.

For CAVIAR dataset in Table IV, the performance enhance-

ments of our approach are the MT, PT and IDS compared

with previous methods. The metric of Frag is very close to

[26] and outperforms other state-of-the-art methods. Some

fragments (Frag) and ID switches (IDS) are corrected based

on our method. We reduce the number of Frag and IDS. The

reduction of Frag and IDS causes some partially tracked (PT)

trajectories become mostly tracked (MT) trajectories. Due to

this reason, the metric MT is higher than previous methods,

and the PT is lower than them.

Finally, the performance of ETH dataset is shown in

Table V. From the results, the performance enhancement of

our method is limited. Due to the view of forward-looking

moving cameras, a lot of people are difficult to detect, such as

too small persons without stable detection responses. Due to

this reason, almost 40% trajectories are partially tracked (PT)

or mostly lost (ML) in our method and [25]. Without detection

responses, it is difficult to track these PT trajectories to formmostly tracked (MT) trajectories. Using our method, on ETH

dataset, the performance of our approach is still superior to

the method of [25], such as the higher of MT, the lower of

PT and Frag.

From the three tables, we analyze the enhancement of track-

ing performance based on the metrics. However, the essential

reason of the enhancement is the spatialtemporal appear-

ance model. Previous methods always extract low-level image

features to represent appearance, such as color histogram,

texture and so on. These features may not be reliable due

to partial occlusions, illumination, and human pose variation,


11/13


Fig. 4. Snapshots of TRECVID dataset. (a) Frame 879. (b) Frame 906. (c) Frame 918. (d) Frame 951. (e) Frame 973.

Fig. 5. Snapshots of CAVIAR dataset. (a) Frame 767. (b) Frame 854. (c) Frame 903. (d) Frame 958. (e) Frame 1049.

Fig. 6. Snapshots of ETH dataset. (a) Frame 88. (b) Frame 97. (c) Frame 124. (d) Frame 130. (e) Frame 142.

TABLE VI

Performance on Terrace Dataset

since they just include the appearance in its own frame, and

do not include appearance information of sequential frames to

refine the appearance feature. Our spatialtemporal appearance

model solves this problem based on statistical distribution and

includes dynamic spatial and temporal information of appear-

ance. The dynamic spatial information is the dynamic number

and layout of appearance subregions. The temporal informa-

tion is the duration time of each subregion. That means each

subregion crosses several frames. We use GMMs to imple-

ment this spatialtemporal appearance model. Each Gaussian

distribution represents a subregion. Therefore, our appearance

model not only provides stable appearance features based on

statistical distribution, but also provides spatial and temporal

information of appearance. Based on our method, we obtain

better performance compared with state-of-the-art methods.

Figs. 46 show some snapshots of our tracking results.

Color arrows show the occlusions of targets and association

results.

In the third experiment, we test the robustness of our

approach on Terrace sequence. The objects appear hundreds

even thousands of frames, and encounter several even dozens

of occlusions. The appearance of object is also varying more

often. The tracklet association will be more complicated. Since

our approach is a monocular tracking method using videos of

camera zero, the results cannot be compared with [42] and

[43], which used multicamera video sequences. Our monocular

results and snapshots are shown in Table VI and Fig. 7.

From the results, our method can track most of the trajec-

tories correctly using monocular camera. Some fragments and

ID switches still exist, especially when the scene is too crowed

up to nine persons in the scene at the same time. The human

detector often misses detection responses, even gives false

positive responses. Therefore, some reliable tracklets cannot

be built successfully based on the conservative method.

In the fourth experiment, the computational cost of our

approach is evaluated on the four datasets. We use the Intel

Core i7 quad-core 2.0 GHz CPU and 4 GB memory to test

the computational cost. The result is shown in Table VII. It is

only the computational cost of tracking, does not include the

computational cost of human detection. The results show the

number of total tracklets which we build, the number of final

tracks and the average frame per second. For CAVIAR and


12/13


Fig. 7. Snapshots of EPFL terrace dataset. (a) Frame 1590. (b) Frame 1629. (c) Frame 1663. (d) Frame 1693. (e) Frame 1716. (f) Frame 1760.(g) Frame 1854.

TABLE VII

Computational Cost of Our Approach on Each Dataset

Terrace dataset, the frame rate is acceptable. For TRECVID

and ETH dataset, the frame rate is low. The main reason is

that the frame resolution of TRECVID and ETH is four times

that of CAVIAR and Terrace dataset. In addition, the number

of targets is more than that of CAVIAR and Terrace dataset.

The TRECVID and ETH datasets need to compute more data.

VII. Conclusion

In this paper, we propose a novel appearance representation

method called spatialtemporal appearance model based on

the distribution of GMM. We use this appearance model to

represent appearance of a tracklet as a whole with dynamic

subregions and dynamic duration time of each subregion.

Furthermore, since the spatialtemporal appearance model is

based on the statistical distribution of GMM, it can avoidillumination, pose variation, and image noise to obtain sta-

ble appearance features. Finally, we associate tracklets using

Bayesian prediction and JSD to obtain the final tracking

results. Our approach is tested on four challenging datasets.

The experimental results show that our approach can achieve

good results.

References

[1] C. Huang, B. Wu, and R. Nevatia, Robust object tracking by hierar-chical association of detection responses, in Proc. Eur. Conf. Comput.Vision, Oct. 2008, pp. 788801.

[2] J. Xing, H. Ai, and S. Lao, Multi-object tracking through occlusionsby local tracklets filtering and global tracklets association with detectionresponses, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern

Recognit., Jun. 2009, pp. 12001207.[3] B. Yang, C. Huang, and R. Nevatia, Learning affinities and de-

pendencies for multi-target tracking using a CRF model, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 12331240.

[4] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based trackingusing the integral histogram, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., Jun. 2006, pp. 798805.

[5] H. Wang, D. Suter, K. Schindler, and C. Shen, Adaptive object trackingbased on an effective appearance filter,IEEE Trans. Pattern Anal. Mach.

Intell., vol. 29, no. 9, pp. 16611667, Sep. 2007.[6] D. Reid, An algorithm for tracking multiple targets, IEEE Trans.

Autom. Comtrol, vol. 24, no. 6, pp. 843854, Dec. 1979.[7] T. E. Fortmann, Y. Bar-shalom, and M. Scheffe, Sonar tracking of

multiple targets using joint probabilistic data association, IEEE J.Ocean. Eng., vol. 8, no. 3, pp. 173184, Jul. 1983.

[8] M. Isard and A. Blake, Condensation-conditional density propagationfor visual tracking, Int. J. Comput. Vision, vol. 29, no. 1, pp. 528,1998.

[9] K. Okuma, A. Taleghani, N. De. Freitas, J. Little, and D. Lowe, Aboosted particle filter: Multitarget detection and tracking, in Proc. Eur.Conf. Comput. Vision, 2004, pp. 2839.

[10] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, Tracking in lowframe rate video: A cascade particle filter with discriminative observersof different lifespans, inProc. IEEE Comput. Soc. Conf. Comput. VisionPattern Recognit., Jun. 2007, pp. 18.

[11] M. D. Breitenstein, F. Reichlin, B. Leibe, E. K. Meier, and L. V. Gool,Robust tracking-by-detection using a detector confidence particle filter,in Proc. IEEE Int. Conf. Comput. Vision, Sep. 2009, pp. 15151522.

[12] M. Yang, F. Lv, W. Xu, and Y. Gong, Detection driven adaptive multi-cue integration for multiple human tracking, in Proc. IEEE Int. Conf.Comput. Vision, Sep. 2009, pp. 15541561.

[13] C. Shan, T. Tan, and Y. Wei, Real-time hand tracking using a meanshift embedded particle filter, Pattern Recognit., vol. 40, no. 7, pp.19581970, 2007.

[14] Y. Cai, N. De. Freitas, and J. J. Little, Robust visual tracking formultiple targets, inProc. Eur. Conf. Comput. Vision, 2006, pp. 107118.

[15] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking,IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564577,May 2003.

[16] Z. H. Khan, I. Y. Gu, and A. G. Backhouse, Robust visual objecttracking using multi-mode anisotropic mean shift and particle filters,

IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 1, pp. 7487,Jan. 2011.


13/13


[17] J. F. Henriques, R. Caseiro, and J. Batista, Globally optimal solutionto multi-object tracking with merged measurements, in Proc. IEEE Int.Conf. Comput. Vision, Nov. 2011, pp. 24702477.

[18] Z. Wu, T. H. Kunz, and M. Betke, Efficient track linking methodsfor track graphs using network-flow and set-cover techniques, in Proc.

IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 11851192.

[19] W. Brendel, M. Amer, and S. Todorovic, Multiobject tracking asmaximum weight independent set, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2011, pp. 12731280.

[20] M. Andriluka, S. Roth, and B. Schiele, People-tracking-bydetectionand people-detection-by-tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.

[21] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, Part-basedmultiple-person tracking with partial occlusion handling, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012, pp.18151821.

[22] H. Izadinia, I. Saleemi, W. Li, and M. Shah, (MP)2T: Multiple peoplemultiple parts tracker, in Proc. Eur. Conf. Comput. Vision, Oct. 2012,pp. 100114.

[23] Y. Li, C. Huang, and R. Nevatia, Learning to associate: Hybrid-boostedmulti-target tracker for crowded scene, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2009, pp. 29532960.

[24] C. Kuo, C. Huang, and R. Nevatia, Multi-target tracking by on-linelearned discriminative appearance models, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2010, pp. 685692.

[25] C. Kuo and R. Nevatia, How does person identity recognition helpmulti-person tracking? in Proc. IEEE Comput. Soc. Conf. Comput.

Vision Pattern Recognit., Jun. 2011, pp. 12171224.[26] B. Yang and R. Nevatia, Multi-target tracking by online learning of

non-linear motion patterns and robust appearance models, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012,pp. 19181925.

[27] R. E. Kalman, A new approach to linear filtering and predictionproblems, Trans. ASME J. Basic Eng., vol. 82, no. 1, pp. 3545, 1960.

[28] S. Kwak, W. Nam, B. Han, and J. H. Han, Learning occlusion withlikelihoods for visual tracking, in Proc. IEEE Int. Conf. Comput. Vision,Nov. 2011, pp. 15511558.

[29] J. Fan, X. Shen, and Y. Wu, Scribble tracker: A matting-based approachfor robust tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 8, pp. 16331644, Aug. 2012.

[30] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 14091422,Jul. 2012.

[31] J. Fan, Y. Wu, and S. Dai, Discriminative spatial attention for robust

tracking, in Proc. Eur. Conf. Comput. Vision, Sep. 2010, pp. 480493.[32] S. T. Birchfield and S. Rangarajan, Spatiograms versus histograms

for region-based tracking, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. Jun. 2005, pp. 11581163.

[33] J. Lin, Divergence measures based on the shannon entropy, IEEETrans. Inf. Theory, vol. 37, no. 1, pp. 145151, Jan. 1991.

[34] B. Yang and R. Nevatia, Online learned discriminative part-basedappearance models for multi-human tracking, in Proc. Eur. Conf.Comput. Vision, Oct. 2012, pp. 484498.

[35] H. W. Kuhn, The hungarian method for the assignment problem,NavalRes. Logistics Quart., vol. 2, nos. 12, pp. 8397, 1955.

[36] S. Kullback and R. Leibler, On information and sufficiency,Ann. Math.Stat., vol. 22, no. 1, pp. 7986, 1951.

[37] A. Ulges, C. Lampert, D. Keysers, and T. Breuel, Spatiogram basedshot distances for video retrieval, in Proc. Online Text Retrieval Conf.Video Retrieval Eval., 2006, pp. 110.

[38] Y. Shen, Z. Miao, and J. Zhang, Unsupervised online learning trajectory

analysis based on weighted directed graph, in Proc. IEEE Int. Conf.Pattern Recognit., Nov. 2012, pp. 13061309.

[39] National Institute of Standards and Technology. Trecvid 2008Evaluation for Surveillance Event Detection [Online]. Available:http://www.nist.gov/speech/tests/trecvid/2008/

[40] Caviar Dataset, (2004, Jan.) EC Funded CAVIAR project/IST 200137540. [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/

[41] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, A mobile vision systemfor robust multi-person tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.

[42] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multi-camera peopletracking with a probabilistic occupancy map,IEEE Trans. Pattern Anal.

Mach. Intell., vol. 30, no. 2, pp. 267282, Feb. 2008.[43] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, Multiple object tracking

using k-shortest paths optimization, IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 9, pp. 18061819, Sep. 2011.

[44] Q. Zhu, S. Avidan, M. C. Yeh, and K. T. Cheng, Fast human detectionusing a cascade of histograms of oriented gradients, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2006, pp.14911498.

[45] X. Wang, T. X. Han, and S. Yan, An HOG-LBP human detector withpartial occlusion handling, in Proc. IEEE Int. Conf. Comput. Vision,Oct. 2009, pp. 3239.

[46] C. Huang and R. Nevatia, High performance object detection bycollaborative learning of joint ranking of granules features, in Proc.

IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2010,pp. 4148.

[47] B. Wu and R. Nevatia, Tracking of multiple, partially occluded humansbased on static body part detection, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2006, pp. 951959.

Yuan Shen received the B.E. and M.E. degrees in2008 and 2010, respectively, from Beijing JiaotongUniversity, Beijing, China, where he is currentlyworking toward the Ph.D. degree.

His research interests include pattern recognition,machine learning, video surveillance, multiobjecttracking, and trajectory analysis.

Zhenjiang Miao (M11) received the B.E. degreefrom Tsinghua University, Beijing, China, in 1987and the M.E. and Ph.D. degrees from NorthernJiaotong University, Beijing, in 1990 and 1994,respectively.

From 1995 to 1998, he was a Post-Doctoral Fellowwith Ecole Nationale Superieure dElectrotechnique,dElectronique, dInformatique, dHydraulique etdes Telecommunications, Institut National Polytech-nique de Toulouse, Toulouse, France, and was aResearcher with Institute National de la Recherche

Agronomique, Sophia Antipolis, France. From 1998 to 2004, he was withInstitute of Information Technology, National Research Council Canada,Nortel Networks, Ottawa, ON, Canada. He joined Beijing Jiaotong University,Beijing, in 2004, where he currently is a Professor. His research interestsinclude image and video processing, multimedia processing, and intelligent

humanmachine interaction.

multihuman tracking based on a spatial–temporal

Documents