multihuman tracking based on a spatial–temporal
TRANSCRIPT
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
1/13
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 361
Multihuman Tracking Based on a SpatialTemporal
Appearance MatchYuan Shen and Zhenjiang Miao, Member, IEEE
AbstractIn this paper, we focus on the improvements ofappearance representation for multihuman tracking. Many pre-vious methods extracted low-level appearance features, suchas color histogram and texture, even combined with spatialinformation for each frame. These methods ignore the temporaldistribution of features. The features of each frame may notbe stable due to illumination, human pose variation, and imagenoise. In order to improve it, we propose a novel appearancerepresentation called the spatialtemporal appearance modelbased on the statistical distribution of Gaussian mixture model(GMM). It represents the appearance of a tracklet as a wholewith dynamic spatial and temporal information. The spatial
information is the dynamic subregions. The temporal informationis the dynamic duration time of each subregion. Each subregionis modeled as the weighted Gaussian distribution of GMM.The online expectation-maximization (online EM) algorithm isused to estimate the parameters of GMM. Then, we proposea tracklet association method using Bayesian prediction andJensenShannon divergence. The Bayesian prediction is used topredict the locations of targets. The JensenShannon divergenceis used to compute the distance of spatialtemporal appearancedistribution between two tracklets. Finally, we test our approachon four challenging datasets (TRECVID, CAVIAR, ETH, andEPFL Terrace) and achieve good results.
Index TermsJensenShannon divergence, multihumantracking, online EM, spatialtemporal appearance.
I. Introduction
MULTIHUMAN tracking in complex environments has
become more and more important in the field of com-
puter vision research. It has many applications, such as video-
based surveillance and humancomputer interaction. Its aim is
to locate targets, retrieve their trajectories, and maintain their
identities through a video sequence. The main challenging
problem is the frequent occlusions of targets in crowded
scenes.
Manuscript received November 7, 2012; revised February 23, 2013, May 26,2013, and July 12, 2013; accepted August 2, 2013. Date of publicationAugust 29, 2013; date of current version March 4, 2014. This work wassupported in part by NSFC 61273274 and NSFB 4123104, in part by the 973Program 2011CB302203, in part by the National Key Technology Researchand Development Program of China under Grant 2012BAH01F03, in part bythe Ph.D. Programs Foundation of Ministry of Education of China under Grant20100009110004, and in part by the Tsinghua-Tencent Joint Laboratory forIIT. This paper was recommended by Associate Editor C. Shan.
The authors are with the Institute of Information Science, Beijing Jiao-tong University, Beijing 100044, China (e-mail: [email protected];[email protected]).
Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2013.2280073
In order to solve the challenging problem of human tracking,
the classical tracking methods mainly follow the framework
based on particle filter. However, it is difficult to track targets
with long time full occlusions in crowded scenes since there
are no observations to guide the trackers. In recent years, due
to the improvements of human detection performance, tracking
by global association has become more and more popular.
This scheme has a general framework to track targets. It links
detection responses in consecutive frames to build tracklets,
which are short tracks for further analysis. Then, an association
algorithm is used to associate tracklets for final trackingresults. Considering the information from future frames, some
detection errors, such as missed detections and false alarms,
can be corrected, and the long time full occlusions can also
be solved.
Most of the global association methods fuse several features
as the affinity measurement [1][3], such as appearance, mo-
tion, position, and size. They always used filter-based methods
to extract motion, position, and size, but still used low-level
image features to represent the entire human appearance, such
as color histogram and texture. Some appearance features with
spatial information are proposed to improve low-level image
features [4], [5]. Though spatial information can improve low-
level image features in partial occlusions, the state-of-the-art methods ignore an important case about full occlusions.
The existing methods always use the latest appearance model
learned by online update to track targets, and discard earlier
appearance gradually. When a target is fully occluded for a
long time and reappears, it is difficult to estimate whether
its appearance is more similar to the latest or the earlier
appearance model. If the target appearance is more similar to
the earlier appearance model, the tracker would drift based
on the latest appearance model to measure the similarity
before and after occlusions. A simple way is to collect the
latest and earlier appearance features of a target into a set.
With the increase of frame steps, there would be a lot of
feature samples. It is time consuming to search the most
similar appearance features in such a large sample space.
In order to record the latest and earlier appearance features,
and maintain the spatial and temporal information of these
appearance features including their spatial layout and temporal
order, we need to explore a new appearance representation.
Based on the above motivation, in this paper, we propose a
new appearance model called spatialtemporal appearance in
the field of multihuman tracking and use this new appearance
model to track multitarget. We still use the tracking framework
1051-8215 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
2/13
362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
of global association. For each tracklet, we perform the auto-
clustering method based on online expectation-maximization
(online EM) algorithm to cluster the appearance feature in
space and time. We adopt RGB color space to represent
human appearance. Pixels with similar colors are clustered
into the same class in space and time. We apply the Gaussian
mixture model (GMM) to represent the classes of the tracklet.
Each class is a subregion of appearance and modeled by the
Gaussian distribution which represents the color distribution,dynamic spatial layout, and time duration of subregions. The
appearance of each tracklet is represented as spatialtemporal
statistical distribution. Based on this distribution of appear-
ance, we can obtain the stable appearance feature, which is
not interfered by pose, illumination variation, image noise,
and so on.
To the best of our knowledge, the main contribution
in this paper is the novel appearance representation called
spatialtemporal appearance. It not only records the dynamic
spatial layout of appearance, but also maintains the dynamic
duration time of each subregion of appearance including the
latest and the earlier frames of the whole tracklet. This
appearance model can provide more information than that of
previous methods for tracklet association. In order to associate
tracklets for final tracking results using this spatialtemporal
appearance model, we propose a tracklet association method
using Bayesian prediction with fuzzy search range, and use
JensenShannon Divergence to compute the similarity of
spatialtemporal appearance.
The rest of the paper is organized as follows. Related work
is discussed in Section II. The overview of our approach is
given in Section III. The spatialtemporal appearance and
tracklet association are presented in Section IV. Section V
shows some implementation details. The experimental results
and discussions are shown in Section VI. Some conclusions
are given in Section VII.
II. Related Work
Object tracking is a hot research field in computer vision
for many years. Many methods have been proposed. The early
works are multihypothesis tracking (MHT) [6] and joint proba-
bilistic data association filters (JPDAF) [7]. MHT enumerated
all possible hypotheses of the target and selected the most
likely hypothesis as its optimal solution. With the increase
of the number of targets and time steps, the original MHT
method will encounter difficulties in computational cost of
hypotheses. The JPDAF method maintained a joint probabilityamong tracking targets in each frame. When new targets enter
the field of camera view or old targets leave the view, the joint
probability needs to be recomputed.
In recent years, the particle filter [8] is a widely used
framework due to its robust performance. Many improved
methods have been proposed. Some methods aim at the
combination of particle filter and detection results. Okuma
et al. [9] combined particle filter with the Ada-boost detection
results to track an unknown number of objects. Li et al. [10]
used multiple detectors to form a cascade particle filter to
enhance the computational speed. The order in which the
detectors were applied was determined based on their com-
putational costs: the faster the earlier. Breitenstein et al. [11]
proposed the continuous confidence of pedestrian detectors,
and used it as a graded observation model to guide particle
filter trackers. Yang et al. [12] used detection responses to
update trackers and extracted multicue features to track targets
including color model, elliptical head model, and bags of
local features. Other methods focus on the improvements
of sampling efficiency. Shan et al. [13] and Cai et al. [14]embedded the mean shift [15] algorithm into particle filter to
improve sampling efficiency of particles to track hands and
multiple persons, respectively. Khan et al. [16] improved the
mean shift embedded method by using multimode anisotropic
mean shift. The particle filterbased tracking methods are
suitable for online applications since their results are only
based on the past frames. These methods do not consider the
information of future frames. When targets are fully occluded
for a long time, these approaches may yield identity switches
or trajectory fragments since there are no observations to guide
the trackers.
In contrast to these methods, which only consider the past
information, many global data association approaches have
been proposed. Global data association considers not only past
frames but also future frames. These methods track targets and
deal with occlusions by finding the best matches before and
after occlusions. They build tracklets based on the detection
responses in consecutive frames and perform association al-
gorithms on these tracklets for final tracking results. Some
researchers call them tracking by tracklet association.
Huang et al. [1] presented a hierarchical association ap-
proach. They built reliable tracklets based on object position,
size, and color histogram of appearance, and used Hungar-
ian algorithm to associate tracklets based on these features.
Finally, they built an entry and exit map to specify the initial-
ization/termination of each tracklet in the scene to enhance
the performance of data association. Xing et al. [2] used
particle filter to refine tracklets and used Hungarian algorithm
to associate tracklets based on color histogram of appearance,
size, and motion of targets. Henriques et al.[17] added merges
and splits measurement of targets to improve the Hungarian
association. Wu et al. [18] used network flow to associate
tracklets only based on motion features. Brendel et al. [19]
formulated the network flowbased tracklet association as
the maximum weight independent set problem, and applied
linear programming to solve it. The methods in [20], [21],
and [22] detected body-parts in tracklets to extract local
appearance features and applied Viterbi, greedy algorithmand network flow to associate tracklets, respectively. Some
researchers applied machine learning algorithm to solve the
tracking problem. Li et al. [23] extracted multiple features to
build a feature pool, including color histogram, tracklet length,
motion and so on, and presented a HybridBoost algorithm to
learn the affinity models between two tracklets. The method
in [3] added pairwise features to improve the feature pool
of [23] and presented CRF-based tracklet affinity models.
Kuo et al. [24], [25] learned an Ada-boost appearance model
to distinguish targets. The tracklet association of [24] was
based on the Ada-boost appearance model. The method of
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
3/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 363
[25] improved the association of [24] by adding motion and
time gap features and was solved by Hungarian algorithm.
Yang et al.[26] improved the Ada-boost appearance model of
[24] and [25] using multiple instance learning and proposed
nonlinear motion pattern-based tracklet association, which was
solved by Hungarian algorithm.
All of these tracklet associationbased methods mainly
focus on the performance improvements using different as-
sociation algorithms. They neglect the importance of featurerepresentation, especially in appearance features. They always
use filter-based methods, such as the Kalman filter [27],
to extract features of motion, position, etc. For appearance
features, they only extract low-level features from the entire
human, such as color histogram, and texture. These appearance
features may not work well in the case of partial occlusions,
illumination variations and so on.
In order to improve appearance representation, some new
methods are proposed by combining spatial information. The
methods in [4], [28], and [29] divided the entire tracking
region of each frame into a fixed number of subregions. The
tracking of the entire object is converted into estimating the
similarity of each subregion. Kalal et al. [30] built ensemble
classifier to track targets based on the PN learning method.
They generated pixel comparisons offline at random and
stayed fixed in runtime. These pixel comparisons recorded the
pixel locations and feature distance. Besides the fixed spatial
information, some methods with dynamic spatial information
are proposed. Fan et al. [31] proposed a dynamic subregion
method, which is called attentional regions (AR) to track
targets. Local ARs were searched based on gradient and
identified based on branch-and-bound procedure to determine
the target location. Low similarity ARs would be removed
and replaced by new ARs. Birchfield et al. [32] proposed a
spatiogram appearance model to track targets based on mean
shift. The spatiogram contained the spatial means and covari-
ances for each color histogram bin. Wang et al. [5] proposed a
spatialcolor mixture of Gaussians (SMOG) appearance model
for particle filters. The appearance model was represented as a
fixed number of Gaussians in runtime. Spatial information and
color distribution were computed for each model of SMOG.
Most of the proposed methods use the definition of fixed
spatial layout [4], [28][30]. It cannot satisfy the dynamic
requirements especially in nonrigid targets, such as pose
changes. Though some methods with dynamic spatial infor-
mation [5], [31], [32] improve the fixed spatial layout, these
methods still have their own weaknesses. They always use
the latest appearance data of objects to update the appearancemodel. With the increase of frame steps, the model will
forget the earlier appearance gradually. When a target is fully
occluded for a long time, the appearance models of these
methods stop updating during occlusions since there are no
observations to update appearance models. Due to the complex
of real scenes, the target appearance may have some variations
when it reappears after occlusions. This may be caused by
illumination variations, even the changes of camera perspective
due to the target movements. In this case, it is difficult for these
methods to estimate the appearance similarity of the target
using the latest appearance model. If the target appearance is
more similar to the earlier appearance model, the similarity
measurement of the latest model would fail.
Our spatialtemporal appearance model will improve the
appearance representations, which are mentioned above, and
promote the tracking performance. Our appearance model not
only provides the dynamic spatial layout of appearance of
each target including the dynamic number and locations of
subregions, but also provides the dynamic duration time of
each subregion. The temporal distribution of appearance notonly records the latest appearance model, but also records
the earlier appearance model. We can dynamically select the
most similar appearance model to associate tracklets for final
tracking results.
Our association method uses the Bayesian motion prediction
with fuzzy search range to guide the appearance association
of tracklets. Compared with MHT, our method only predicts
the most likely motion direction instead of all possible pathes
which need more computational cost. In order to compensate
the imprecise prediction position, we add the fuzzy search
strategy.
III. Overview of Our Approach
We adopt the framework of tracking by tracklet association
in our approach. This framework can be mainly formalized as
(1). L is the set of association results L = {l1,...,lN}. Each
element of L represents whether two tracklets can be associ-
ated.Sis the set of tracklets S={TR1,...,TRM}.f()is a cost
function to associate tracklets. Equation (1) represents that the
association results are the best matches of tracklets subject to
the nonoverlap restriction. The nonoverlap restriction means
that two tracklets can not have the overlapping duration time,
and two association results can not share the same tracklet
L
= arg maxL f(L|S)
Subject to
T Ri T Rj=,i, j M
lp lq =,p, q N.
(1)
In previous methods, such as in [1][3], [23][26], each
tracklet TRi is represented by a sequence of detection re-
sponses TRi =
rts
i ,...,rte
i
, where rt
s
i and rte
i denote the start
and end frames of tracklet TRi. The features of each tracklet
are extracted from each detection response independently. For
example, each response is represented as rti =
ati, pti, s
ti, v
ti
in frame t, where a ti is the appearance, p
ti is the position, s
ti is
the size, and vti is the velocity.
In this paper, we improve the feature extraction of appear-
ance by building spatialtemporal appearance model for eachtracklet instead of appearance extraction from each response
in each frame independently. Therefore, in our method, the
complete representation of TRi is TRi =
Gi,
rts
i ,...,rte
i
,
where Gi is the spatialtemporal appearance model of the
whole tracklet TRi. Then, Gi is used to associate tracklets.
The flow chart of our approach is shown in Fig. 1. First,
reliable tracklets are built from the detection responses. Then,
there are two main components. One is the spatialtemporal
appearance model that is built from each tracklet. We extract
spatialtemporal appearance based on color feature. Similar
colors are clustered in space and time based on online EM
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
4/13
364 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
Fig. 1. Overview of our approach. In the framework of tracking by tracklet association, a new appearance representation called spatialtemporal appearancemodel is proposed and used to associate tracklets.
algorithm. The classes of the tracklet are formulated as a
GMM. Each class corresponds to a Gaussian distribution.
The appearance of each tracklet is modeled as a whole
rather than independently for each frame. The other is the
tracklet association. We predict the locations of targets in the
current frame to obtain the motion cost between two tracklets,
and apply Gaussian selection algorithm to the GMM of
spatialtemporal appearance to find the GMM subsets which
are the minimal appearance distance between two tracklets.
JensenShannon divergence (JSD) is used to compute the ap-
pearance cost between two tracklets. Spatialtemporal appear-ance cost and motion cost are used to associate tracklets for
final tracking results. The spatialtemporal appearance model
and the tracklet association are described in Sections IV-A
and IV-B, respectively.
IV. SpatialTemporal Appearance and
Tracklet Association
A. SpatialTemporal Appearance
Given the set of reliable tracklets, we start to extract
spatialtemporal appearance for each tracklet. In order to
achieve this goal, we apply GMM for this task. Each Gaussian
distribution represents spatialtemporal color distribution of
pixels which belong to the same class. Each Gaussian distri-
bution corresponds to a class. The weight of each Gaussian
model represents whether the Gaussian distribution is impor-
tant or not: the greater weight, the higher importance. Here, we
apply the online EM algorithm to estimate the parameters of
GMM rather than offline EM since the association of tracklets
is designed from previous frames until the current frame. With
the increase of frame steps, tracklets which are not terminated
in the current frame may grow in the future frames. The online
EM can satisfy this case when the data is being added for
each future frame to update the parameters of GMM. If we
use offline EM to estimate the parameters of GMM, with theaddition of new data of future frames, the parameters of GMM
must be recomputed completely using the new sample space
including old samples and new samples. This would waste lots
of time due to recomputing the distribution of all samples. That
means tracklets cannot compute the appearance distribution
using offline EM until they are terminated completely before
the current frame. Based on this reason, the online EM is more
appropriate. It can achieve reasonable statistical results with
the new data being added in the current frame. In order to
show how we use online EM to estimate the parameters of
GMM in our paper, we build Algorithm 1.
Algorithm 1: Learning GMM appearance model
Input: Tracklet TR= {rt}, initialize t at the beginning of TR,K = 0
repeat1for each pixel x
rti ofr t do2
if K = 0 then3Initialize a new Gaussian distribution; K K + 14
end5E-step: for k = 1to K do6
ik = N
xrti |k, k
7
end8
if max (ik)< 1 then9Initialize a new Gaussian distribution; K K + 1;10break
end11M-step: for k = 1to K do12
ik = ik
Km=1
im
; Ck Ck + ik
13
newk = (1 ik
Ck)oldk +
ikCk
xrti14
newk = (1 ik
Ck)oldk +
ikCk
(xrti
newk )(x
rti
newk )
T15
k = Ck
Km=1
Cm16
end17end18for a= 1to K do19
for b= 1to K do20if JSD (Na||Nb) < then21
ab = CaCa+Cb
a+ CbCa+Cb
b22
ab = CaCa+Cb
a+ CbCa +Cb
b23
ab = Ca+Cb
Kk=1
Ck24
Tab = max
Tenda , Tend
b
min
Tstarta , T
startb
25
end26end27
end28
Compute new end frame Tendk for each Gaussian Nk29
Tk =Tend
k Tstart
k30
t t+ 131until rt is the tail of TR ;32Output: GMM appearance distribution includes K Gaussian
distributions
First, we need to define some important variables before
showing the algorithm of building spatialtemporal appear-
ance. Suppose G is the GMM appearance distribution for
a tracklet TR. Nk is one of the Gaussian distributions in
the GMM. G = {Nk}, k is the index of each Gaussian
distribution. For each Gaussian distribution Nk, we define
Nk = {k, k, k, Tk}. k is the weight of the Gaussian
distribution Nk. k is the mean. k is the covariance matrix.
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
5/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 365
Finally, Tk is the duration time of the Gaussian distribution
Nk. For k and k, they still include five parameters. The
five parameters are position x and y which are normalized by
the width and height of the human detection bounding box,
color channelsR,G, andB, which are normalized by the value
range of RGB color space. They are independent distributions.
This is shown in (2). diag{} denotes a diagonal matrix
k ={kx, ky, kR, kG, kB}T
k =diag
2kx, 2ky,
2kR,
2kG,
2kB
. (2)
Here, we must further explain the definition of variables. For
the position distribution, we use {kx, ky, kx, ky} to model
it. Since there are some displacements for the location of the
same class in each frame as a class crosses several frames, it is
difficult to use a fixed position range to describe it. Due to this
reason, we use Gaussian distribution to model the position of
the class. In addition, the position of each class is the relative
position in the detection bounding box, rather than the absolute
position in the image.
For the duration time of each Gaussian, it is used to
constrain whether Gaussians can be a subset to compute the
similarity with another subset. Only Gausians with overlapping
duration time can be a subset. We do not use Gaussian
distribution to model it since the variance cannot reflect the
real duration time of each Gaussian. For example, the duration
time of Gaussian a is from frame 1 to 10. The duration time of
another Gaussianb is from frame 6 to 30. These two Gaussians
have the overlapping time of five frames. If we use Gaussian
to model the time dimension, the time distribution of Gaussian
a is at= 5.5, at= 2.87, the time distribution of Gaussian b
isbt= 18,bt= 7.2. It would be difficult to estimate whether
two Gaussians are overlapping or not using the time variance.
The algorithm of building spatialtemporal appearance is
shown in Algorithm 1. The input of this algorithm is a
trackletTR. r tis the detection response ofTR in frame t. We
initialize the frame index tat the beginning of tracklet TR. The
number of Gaussian distributionK is initialized to 0 before the
algorithm is computed. For each tracklet TR, we compute the
parameters of GMM distribution using online EM algorithm
based on the detection response of each frame until the tail
of tracklet TR. For each detection response rt of tracklet
TR, we compute the similarity of each pixel of response
rt with each Gaussian distribution and select the maximal
similarity. If the maximal similarity is less than threshold 1,
we initialize a new Gaussian distribution in GMM. Otherwise,
all Gaussian distributions will be updated based on the online
EM algorithm. From line 6 to 11 is the E-step of online EM.From line 12 to 17 is the M-step of online EM. In line 13,
the similarity ik of each Gaussian distribution is normalized
to ik and summed into Ck to form the final update factor ik
Ck.
The componentik of this update factor updates Gaussians by
a proportion of their estimated posterior probability in each
frame. The component Ck guarantees the stability of GMM
parameters when new samples are being added, due to the
accumulation of a large number of samples. After the online
EM in each frame, we need to check whether two Gaussian
distributions should be merged. Here, we use JSD [33] to
estimate the distance between two Gaussian distributions. The
Fig. 2. Illustration of spatialtemporal appearance model. Each ellipsoidrepresents a Gaussian distribution of appearance model and the duration timeof Gaussian distribution.
definition of JSD distance will be described in (6) and (7)
of Section IV-B. When the distance between two Gaussian
distribution a and b is less thanwhich is a very small number,
we merge them based on the amount proportion of Ca and
Cb. This is shown from line 22 to 24. The duration time ofGaussian distribution a and b is still merged based on line 25.
Finally, in lines 29 and 30, we update the duration time of
each Gaussian distribution, and repeat the above algorithm
for the next frame t+ 1 until the tail of tracklet TR. Based
on this algorithm, the schematic diagram of spatialtemporal
appearance is shown in Fig. 2.
B. Tracklet Association
After building the spatialtemporal appearance model for
each tracklet, we start to associate them for final tracking
results. The main idea of our tracklet association is based on
the motion prediction of Bayesian methods, since the motion
of targets in the current frame can be predicted based on the
motion of several previous frames. In order to implement this
strategy, in computer vision research, a popular prediction
tool called Kalman filter [27] is typically used to predict
the location of target. Kalman filter is a linear system with
Gaussian noise based on the Markov chain, it predicts the
target location only based on the information of last frame. By
repeating the Kalman filter frame by frame, we can predict the
location of the target. When a target is occluded, the reliable
tracklet of the target is terminated. Then, based on the latest
Kalman state of this tracklet, we can predict the approximate
location of the target in the future frames when the target
is occluded. However, the linear motion prediction may beimprecise in long frame gaps due to the nonlinear motion [34].
We will propose a strategy to alleviate it in the following part.
When we obtain the prediction location of the occluded
target, in the normal case, we always search the target around
the prediction location to track it continually. However, the
range of around the prediction location is difficult to define
precisely by a fixed boundary. We propose a fuzzy search
method based on exponential distribution instead of the fixed
boundary. The exponential distribution can be modeled by
d(Lm,Ln). is the base of the exponential distribution. d()
is the normalized Euclidean distance between the prediction
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
6/13
366 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
location Lmand the target location of another tracklet Lnin the
current frame. With the increase of search radius, the location
distance is larger and larger. That means the probability of
finding the target is lower and lower with the increase of search
radius
Sm,n =JSD (Gm||Gn) d(Lm ,Ln). (3)
In order to associate tracklets to form the final trackingresults, we still need to use the spatialtemporal appearance
model to measure the appearance similarity between tracklets.
The association of tracklets can be represented as (3). Sm,nrepresents the similarity distance between two tracklets. If
Sm,n is lower than any other tracklet pairs, the two tracklets
can be associated. This can be solved by the Hungarian
algorithm [35] in our approach. In fact, for the efficiency of
implementation, the approach needs a time sliding window on
the video sequence to compute the smallest distance of Sm,n,
it does not process the whole video together. JSD () is the
JSD to compute the similarity between two spatialtemporal
appearance models. Gm and Gn are two GMM distributions.
We will describe the details of JSD in the following part. The
parameterd(Lm ,Ln) is the exponential distribution of the fuzzy
search method. In order to estimate whether two tracklets can
be a pair to compute the similarity, the nonoverlap restriction
which is shown in (1) must be employed. This restriction is
reasonable, since a person cannot belong to two tracks or
appear in two places at the same time.
The distance between two GMM distributions can be com-
puted by using their weight, mean, and variance instead of
comparing each sample pair. This can be solved by using
JSD or KullbackLeibler divergence (KLD) [36]. We employ
JSD to compute the similarity between two GMM distribu-
tions instead of KLD, since KLD is a nonsymmetric and
unnormalized distance. KLD is in the range of [0, +). The
distance of KLD (Gm||Gn) is different from the distance of
KLD (Gn||Gm). However, JSD is in the range of [0, 1], and
it is a symmetric distance. Therefore, JSD is more suitable to
evaluate the distance between two GMM distributions.
The similarity between two GMMs cannot be still computed
directly since different order of Gaussians in GMMs will lead
to different distance. We need to explore a strategy to find the
smallest distance between two GMMs. The smallest distance
will lead to the strictest comparison in the global tracklet
association. Furthermore, the appearance of a person must
appear as a whole. His subregions of appearance must appear
with overlapping duration time, otherwise, the combinationof subregions would be meaningless. In order to satisfy the
above requirement, which is the smallest distance between
two GMMs under the restriction of overlapping duration
time, we present Algorithm 2 to select Gaussian distributions
from two GMMs to compute the similarity. The selection of
this algorithm is mainly based on the parameter Tk in each
Gaussian distribution.
The selection algorithm of Gaussian distribution is shown in
Algorithm 2. We first input two GMM distributions ( Gm and
Gn), which belong to two tracklets, respectively, and initialize
two sets A and B to store the selected Gaussian distributions
Algorithm 2: Gaussian selection from GMM
Input: Two GMM distribution Gm and Gn, Initialize two setsA= B =
for each Nm,i of Gm do1for each Nn,j of Gn do2
mi,nj = JSD
Nm,i||Nn,j
3
end4end5
(i, j)= argmini,j
mi,nj
6
Add Nm,i to A, Add Nn,j to B7repeat8
Initialize two sets AT = and BT =9for each Nm,i of Gm,Nm,i /A and Tm,i TA = do10
Add Nm,i to AT11for each Nn,j of Gn, Nn,j /B and Tn,j TB = do12
Add Nn,j to BT13
mi,nj = JSD
Nm,i||Nn,j
14
end15end16if AT= and BT= then17
(i, j) = argmini,j
mi,nj
18
Add Nm,i to A, Add Nn,j to B19
end20 if (AT= and BT = ) or (AT = and BT=) then21
for each Gaussian NATp (orNBTq ) of AT (orB T) do22
Find the Gaussian NB (or NA) in set B (or A) with23the minimal JSD distanceAdd NATp (or N
BTq ) to A (or B )24
Repeat NB (or NA) in B (or A)25end26
end27until A and B do not increase ;28Normalize the weights of Gaussians in set A and B , respectively29Output: The set A and B, they are called as GMM
distribution Gm and Gn, respectively
The Gaussian number of Gm and G
n are K
m and K
n,respectively
K
m = K
n, we call them K
from Gm andGn, respectively.Gm is a GMM distribution that
belongs to a terminative tracklet. Gn is a GMM distribution
that belongs to an initialized tracklet. First, from line one to
seven, we compute the similarity of each pair of Gaussian
distributions in Gm and Gn, and store the minimal distance
pair
Nm,i , Nn,j
to the set A and B, respectively. Add Nm,i
to A, add Nn,j to B . Then, from lines 8 to 28, we repeat the
selection process until the set A and B do not increase.
In lines 10 and 12 that have the same meaning, the con-
ditions of loop are changed compared with line 1 and 2.
For line 10, Nm,i does not belong to the set A before itcomputes the similarity. TA represents the duration time of
each Gaussian distribution in set A. Tm,i TA = represents
that the duration time Tm,i ofNm,i overlaps the duration time
of each Gaussian distribution in set A, this is illustrated in
Fig. 3(a). This condition ensures the Gaussian distributions
in set A (or B) appear at the same time since the integrated
appearance of a person must appear as a whole. However, the
Gaussian distributions of the original input Gm (or Gn) may
not appear simultaneously. Just like Fig. 3(b), some Gaussian
distributions do not appear simultaneously. From line 17 to 27,
we select the minimal distance pair to store them. When
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
7/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 367
Fig. 3. Slection of Gaussian distribution
AT = and BT = , we can directly compute the minimal
distance of Gaussian distributions, and store them into A and
B. It is shown from line 17 to 19. In line 21, if one of these
two sets is empty set, we need to follow the line 22 to 26 to
find a match. This is very important. If a Gaussian distribution
in ATcan not find a match in BT, it will be discarded. This
leads to missing features. We need to repeat some Gaussian
distributions in set B to match them. Finally, in line 29, we
normalize the weights of Gaussian distributions in set A and
B, respectively.After the selection process, two setsAandB are output. The
Gaussian distributions in set A are called as GMM distribution
Gm. The Gaussian distributions in set B are called as GMM
distribution Gn. The Gaussian distribution number ofGm and
Gn are Km and K
n, respectively. Since K
m = K
n, we call
them K.
Algorithm 2 shows the dynamic selection algorithm of
Gaussian distribution for the minimal GMM distance. The
dynamic selection may select the GMM appearance which
appeared the earliest, or the latest, as long as the appearance
distance between two tracklets is minimal.
After the selection of Algorithm 2, the distance between
two GMM distributions can be computed by (4). m,k andn,k are the weights of Gaussian distribution Nm,k and Nn,k .
Equation (3) can be modified by (5)
JSD
Gm||Gn
=
Kk=1
m,k +n,kK
k=1
m,k +n,k
JSD Nm,k ||Nn,k (4)
Sm,n = JSD
Gm||Gn
d(Lm ,Ln). (5)
Finally, we need to compute the JSD distance between two
single Gaussian distributions. In order to understand easily,we assume two single Gaussian distributions N1 and N2to compute the JSD distance. N1 corresponds to Nm,k , N2corresponds to Nn,k . The definition of JSD in line 21 of
Algorithm 1 is the same as the descriptions of this part. N1and N2 correspond to Na and Nb in Algorithm 1, respectively
JSD (N1||N2)
=1
2
KLD
N1||N
+KLD
N2||N
(6)
N =1
2(N1+ N2) . (7)
The general form of JSD is shown in (6) and (7). N1 =
{1, 1, 1} and N2 = {2, 2, 2} are two Gaussian dis-
tributions to compute the JSD. KLD () is the KLD. N =
{, , } is the mixture distribution ofN1 and N2. In paper
[37], N is computed by (8) and (9) without Gaussian weight
=1
2(1+ 2) (8)
=
1
2
1+ 2+
T
1 1+
T
2 2
T
. (9)Since our spatialtemporal appearance model is based on
GMM distribution, the weight of each Gaussian distribution
is still important. We modify (8) and (9) with the weight of
Gaussian distribution. It is shown in (10), (11), and (12). KLD
is shown in (13) and (14). Here, we compute the divergence of
KLD
N1||N
as an example, the KLD
N2||N
is the same
as it. P is the normalized Gaussian distribution (
xP(x) = 1)
and dim is the dimension of the Gaussian distribution
=1
2(1+ 2) (10)
=(11+ 22)
1+ 2
(11)
=1
1+ T1 1
1+ 2
+
2
2+ T2 2
1+ 2
T
(12)
KLD
N1||N
=
x
N1(x) lnN1(x)
N (x)
=
x
1P1(x) ln1P1(x)
P (x)
=1ln1
x P1(x) +1 x P1
(x) lnP1(x)
P (x)
=1ln1
+1
x
P1(x) lnP1(x)
P (x).
(13)
ReplacingPwith Gaussian distribution function, we obtain
KLD
N1||N
=1ln1
+1
1
2
ln
|1|
+tr
1
dim
+
1 T
1
1
.
(14)
V. Implementation Details
A. Building Reliable Tracklets
We borrow the idea of [1] to build reliable tracklets for
the same reasons in [3] and [23][26], since it is a simple
and conservative method. This approach considers that the
changes of the target are very small in two consecutive frames
including position displacement, the changes of appearance
and so on. The affinity is formalized as
rt = arg m in
d
Lrit
,Lrt1
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
8/13
368 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
frame t. Lrt1 represents the center location of the observation
rt1. d() is the Euclidean distance. B () is the Hellinger
distance of the color histograms between two observations, it
is defined as
1 BC
rit, rt1
, BC () is the Bhattacharyya
coefficient. represents the range of the neighborhood posi-
tion. In (15), we compute the appearance distance between two
detection responses in the neighborhood position ofrt1, and
link the response of minimal distance rt to the corresponding
tracklet. This strategy is conservative and biased to link only
reliable association between any two consecutive frames. In
order to prevent unsafe association of responses, in [1], a
boundary value of the affinity 2 is defined. In other words,
two responses are linked if and only if their affinity distance
is lower than the threshold 2 and significantly lower than the
distance of any other pairs. In our method, the boundary value
2 = 0.3 since the Hellinger distance is in the range of [0, 1].
Since the human detector only gives the approximate size of
detection responses, the size may be larger or smaller than the
size of the ground truth target. In order to extract accurate
features for the tracklet association, tracklet refinement is
needed. Let st
is the size of an observation rt
in a tracklet
TR, where t is the index of frames. The refinement can be
formalized as
st=
i=1
sti
. (16)
is a time sliding window in the refinement filter, and is set
to five. The size of the detection responsertis refined based on
the average size of previous frames. By repeating this equation
until the tail of the tracklet, the size of each detection response
in the tracklet can be smoothed.
B. Entry and Exit Model
In order to use Hungarian algorithm to compute the optimal
association results of (3), the entry/exit model of scenes must
be defined to specify the initialization/termination of each
tracklet in the camera view. The entry/exit model includes
two parts in our approach. One is the boundary of the camera
view, the other is the entry/exit zones in the camera view.
The boundary of the camera view is easy to specify the
initialization/termination of a tracklet. When a target enters the
image from the boundary of the camera view, its tracklet is
initialized. When a target leaves the image from the boundary
of the camera view, its tracklet is terminated. The entry/exit
zones in the camera view need to be learned.
We use our previous work [38] to finish this task. Thiswork is a trajectory analysis research that can also learn the
entry/exit zones. In [38], each trajectory is represented as
a sequence of key points and start/end points. The features
of each key point are coordinates, turning angle (TA), and
turning angle direction (TAD). For the start/end points, the
features are only coordinates. Key points are used to learn
the classification of trajectories. The start/end points are used
to learn the entry/exit zones of scenes. In our approach, only
coordinate features of start/end points of each trajectory are
extracted without key points, since we only use the method of
[38] to learn the entry/exit zones without the task of trajectory
classification. Start/end points of trajectories are clustered
using unsupervised EM algorithm. Each class is modeled as
Gaussian distribution, and represented as
x, y, 2x ,
2y
. x
and y are the means of coordinates x and y. 2x and
2y
are their variances. Finally, the classes of start/end points of
trajectories are the entry/exit zones.
VI. Experiments
A. Dataset and Test Method
1) Dataset: We use four challenging datasets to test our
approach. These four datasets are the TRECVID 2008 [39],
the CAVIAR dataset [40], the ETH Mobile Platform dataset
[41], and the Terrace sequence of EPFL dataset [42], [43]. All
of these datasets include many mutual occlusions in crowded
scenes.
Many state-of-the-art human detectors can be adopted to
detect human responses, such as [44][46] and so on. For
TRECVID, ETH, CAVIAR datasets, to compare with state-
of-the-art tracking approaches fairly, we adopt the human
detection results that are used in [25] since these detectionresults are also used in [1], [3], [23], [24], and [26]. For the
Terrace sequence, we use the method of [44].
The TRECVID 2008 dataset is hundreds of hours from
five fixed cameras covering different field-of-views. We follow
the data setting of the paper [23] since they first used this
dataset. Their setting is also used in [1], [3], and [24][26]. The
CAVIAR dataset is captured in a shopping center corridor by
two fixed cameras from two different viewpoints and contains
26 video sequences. We follow the setting of [25] which
selected the 20 videos only using the corridor view. Their
setting is also used in previous works, such as [1], [23], [24],
and [26]. The ETH dataset is captured by a stereo pair of
forward-looking moving cameras in a busy street scene. Due
to the lower position of the cameras, full occlusions also
often happen in these videos. We follow the dataset setting
proposed in [25] and its ground truth. In [25], only left view of
videos was used without stereo depth maps. The EPFL Terrace
sequence uses four cameras with an overlapping camera view.
The objects appear very long time, and their appearance is
varying more often. Since our method only uses a monocular
camera, we only use the data of camera 0 to test our approach.
The details of these datasets are shown in Table I.
2) Test Method and Evaluation Metrics: We conduct four
experiments to evaluate the effectiveness of our approach.
The first experiment tests how the parameter value affectsthe tracking performance. The second experiment compares
our approach with several state-of-the-art methods on the
TRECVID, ETH, and CAVIAR dataset. The comparison
methods include [1][3], [23][26] since these methods are
the latest results and can reflect the best performance of
tracking by tracklet association. The third experiment tests the
robustness of our approach using the terrace sequence of EPFL
dataset, since the objects appear hundreds even thousands of
frames, and the appearance of object is varying more often in
these sequences. The fourth experiment is the computational
cost of our approach.
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
9/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 369
TABLE I
Dataset
TABLE II
Discussion of Parameter Values
In our experiments, to compare with state-of-the-art ap-
proaches fairly, we adopt the metrics used in [23] to eval-
uate the tracking performance since all of these comparison
methods follow the metrics.
The metrics in [23] are an improved version of the original
metrics in paper [47]. In [23], the track fragments and ID
switches are more strict but better defined than the original
definition in [47]. The metrics are as follows.
1) Ground truth (GT): The number of trajectories in the
ground truth.
2) Mostly tracked trajectories (MT): The percentage of
trajectories that are successfully tracked for more than
80% divided by GT.
3) Partially tracked trajectories (PT): The percentage of
trajectories that are tracked between 20% and 80%
divided by GT.
4) Mostly lost trajectories (ML): The percentage of trajec-
tories that are tracked for less than 20% divided by GT.
5) Fragments (Frag): The total number of times that a
trajectory in GT is interrupted.
6) ID switches (IDS): The total number of times that a
tracked trajectory changes its matched GT identity.
Since multiobject tracking can be viewed as a method that
is able to recover missed detections and remove false alarms
from the raw detection responses, the metrics for detection
evaluation are provided.
1) Recall: The number of correctly matched detections
divided by the total number of detections in ground truth.
2) Precision: The number of correctly matched detections
divided by the number of output detections.
3) False Alarm per Frame (FAF): The number of false
alarm per frame.
The higher value, the better is the performance in MT, recall
and precision, the lower value, the better is the performance
in PT, ML, Frag, IDS, and FAF.
B. Performance
In the first experiment, we test how the parameter value
affects the tracking performance on the TRECVID, ETH,
and CAVIAR dataset. We only use tracking metrics MT,
PT, ML, Frag, IDS to evaluate this experiment since the
metrics of human detection such as Recall, Precision, and FAF
cannot reflect the changes of tracking performance directly.
The results are shown in Table II. The best results are shown
using bold face. With the increase of the parameter , the
tracking performance goes up. When = 100, we obtain
the best performance on three datasets. If the parameter
increases continually, such as = 150, the performance on
three datasets cannot increase , even decrease. This is because
the parameter affects the search range of tracklet association.
With the increase of , the value of exponential distribution
increases faster and faster. If is too large, a small distance
will lead to a very large value of d(Lm ,Ln). This large value
will dominate appearance features and decrease the trackingperformance. From this experiment, = 100 is appropriate. In
the subsequent experiments, we will use = 100 to test our
approach. For other parameters, 1 = 0.1, it is in Algorithm 1,
2 = 0.3, = 5, they are in Section V-A.
In the second experiment, our method is compared with
state-of-the-art methods based on the complete metrics on
three challenging datasets TRECVID, CAVIAR, and ETH. We
show test results in Tables IIIV. From the three tables, the
metrics recall, precision, and FAF are very close to previous
methods. These metrics indicate the misses and false alarms
of detections. Even though some of them outperform previous
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
10/13
370 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
TABLE III
Comparison of Tracking Results on TRECVID Dataset
TABLE IV
Comparison of Tracking Results on CAVIAR Dataset. *The Number of Frag and IDS Followed the Metrics in [47]
TABLE V
Comparison of Tracking Results on ETH Dataset
methods, the enhancement is very limited. The first reason is
that we use the same detection results as previous methods.
The second reason is that our tracklet building method is simi-
lar to previous approaches that are based on the method in [1].
Furthermore, we do not propose new methods to recover de-
tections in this paper. Our method focuses on the performance
improvement of tracklet association for tracking metrics based
on the spatialtemporal appearance model. Therefore, if the
performance of human detectors or the methods of detection
recovery cannot be improved, the performance enhancement
of human detection is limited only relying on the refinement
of tracking algorithms. For other metrics, we analyze them as
follows.
Table III shows the tracking result on TRECVID dataset.
Our performance outperforms the previous methods. The MTis higher, and the PT and ML are lower than previous methods.
For the metrics Frag and IDS, our method is much less than
previous methods. Especially, though the methods of [25]
and [26] have reduced a lot of Frag and IDS, our method
still outperforms them. For the dataset CAVIAR and ETH,
though the enhancement of performance is not higher than
TRECVID dataset, we also obtain better performance than
previous methods.
For CAVIAR dataset in Table IV, the performance enhance-
ments of our approach are the MT, PT and IDS compared
with previous methods. The metric of Frag is very close to
[26] and outperforms other state-of-the-art methods. Some
fragments (Frag) and ID switches (IDS) are corrected based
on our method. We reduce the number of Frag and IDS. The
reduction of Frag and IDS causes some partially tracked (PT)
trajectories become mostly tracked (MT) trajectories. Due to
this reason, the metric MT is higher than previous methods,
and the PT is lower than them.
Finally, the performance of ETH dataset is shown in
Table V. From the results, the performance enhancement of
our method is limited. Due to the view of forward-looking
moving cameras, a lot of people are difficult to detect, such as
too small persons without stable detection responses. Due to
this reason, almost 40% trajectories are partially tracked (PT)
or mostly lost (ML) in our method and [25]. Without detection
responses, it is difficult to track these PT trajectories to formmostly tracked (MT) trajectories. Using our method, on ETH
dataset, the performance of our approach is still superior to
the method of [25], such as the higher of MT, the lower of
PT and Frag.
From the three tables, we analyze the enhancement of track-
ing performance based on the metrics. However, the essential
reason of the enhancement is the spatialtemporal appear-
ance model. Previous methods always extract low-level image
features to represent appearance, such as color histogram,
texture and so on. These features may not be reliable due
to partial occlusions, illumination, and human pose variation,
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
11/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 371
Fig. 4. Snapshots of TRECVID dataset. (a) Frame 879. (b) Frame 906. (c) Frame 918. (d) Frame 951. (e) Frame 973.
Fig. 5. Snapshots of CAVIAR dataset. (a) Frame 767. (b) Frame 854. (c) Frame 903. (d) Frame 958. (e) Frame 1049.
Fig. 6. Snapshots of ETH dataset. (a) Frame 88. (b) Frame 97. (c) Frame 124. (d) Frame 130. (e) Frame 142.
TABLE VI
Performance on Terrace Dataset
since they just include the appearance in its own frame, and
do not include appearance information of sequential frames to
refine the appearance feature. Our spatialtemporal appearance
model solves this problem based on statistical distribution and
includes dynamic spatial and temporal information of appear-
ance. The dynamic spatial information is the dynamic number
and layout of appearance subregions. The temporal informa-
tion is the duration time of each subregion. That means each
subregion crosses several frames. We use GMMs to imple-
ment this spatialtemporal appearance model. Each Gaussian
distribution represents a subregion. Therefore, our appearance
model not only provides stable appearance features based on
statistical distribution, but also provides spatial and temporal
information of appearance. Based on our method, we obtain
better performance compared with state-of-the-art methods.
Figs. 46 show some snapshots of our tracking results.
Color arrows show the occlusions of targets and association
results.
In the third experiment, we test the robustness of our
approach on Terrace sequence. The objects appear hundreds
even thousands of frames, and encounter several even dozens
of occlusions. The appearance of object is also varying more
often. The tracklet association will be more complicated. Since
our approach is a monocular tracking method using videos of
camera zero, the results cannot be compared with [42] and
[43], which used multicamera video sequences. Our monocular
results and snapshots are shown in Table VI and Fig. 7.
From the results, our method can track most of the trajec-
tories correctly using monocular camera. Some fragments and
ID switches still exist, especially when the scene is too crowed
up to nine persons in the scene at the same time. The human
detector often misses detection responses, even gives false
positive responses. Therefore, some reliable tracklets cannot
be built successfully based on the conservative method.
In the fourth experiment, the computational cost of our
approach is evaluated on the four datasets. We use the Intel
Core i7 quad-core 2.0 GHz CPU and 4 GB memory to test
the computational cost. The result is shown in Table VII. It is
only the computational cost of tracking, does not include the
computational cost of human detection. The results show the
number of total tracklets which we build, the number of final
tracks and the average frame per second. For CAVIAR and
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
12/13
372 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014
Fig. 7. Snapshots of EPFL terrace dataset. (a) Frame 1590. (b) Frame 1629. (c) Frame 1663. (d) Frame 1693. (e) Frame 1716. (f) Frame 1760.(g) Frame 1854.
TABLE VII
Computational Cost of Our Approach on Each Dataset
Terrace dataset, the frame rate is acceptable. For TRECVID
and ETH dataset, the frame rate is low. The main reason is
that the frame resolution of TRECVID and ETH is four times
that of CAVIAR and Terrace dataset. In addition, the number
of targets is more than that of CAVIAR and Terrace dataset.
The TRECVID and ETH datasets need to compute more data.
VII. Conclusion
In this paper, we propose a novel appearance representation
method called spatialtemporal appearance model based on
the distribution of GMM. We use this appearance model to
represent appearance of a tracklet as a whole with dynamic
subregions and dynamic duration time of each subregion.
Furthermore, since the spatialtemporal appearance model is
based on the statistical distribution of GMM, it can avoidillumination, pose variation, and image noise to obtain sta-
ble appearance features. Finally, we associate tracklets using
Bayesian prediction and JSD to obtain the final tracking
results. Our approach is tested on four challenging datasets.
The experimental results show that our approach can achieve
good results.
References
[1] C. Huang, B. Wu, and R. Nevatia, Robust object tracking by hierar-chical association of detection responses, in Proc. Eur. Conf. Comput.Vision, Oct. 2008, pp. 788801.
[2] J. Xing, H. Ai, and S. Lao, Multi-object tracking through occlusionsby local tracklets filtering and global tracklets association with detectionresponses, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern
Recognit., Jun. 2009, pp. 12001207.[3] B. Yang, C. Huang, and R. Nevatia, Learning affinities and de-
pendencies for multi-target tracking using a CRF model, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 12331240.
[4] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based trackingusing the integral histogram, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., Jun. 2006, pp. 798805.
[5] H. Wang, D. Suter, K. Schindler, and C. Shen, Adaptive object trackingbased on an effective appearance filter,IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 9, pp. 16611667, Sep. 2007.[6] D. Reid, An algorithm for tracking multiple targets, IEEE Trans.
Autom. Comtrol, vol. 24, no. 6, pp. 843854, Dec. 1979.[7] T. E. Fortmann, Y. Bar-shalom, and M. Scheffe, Sonar tracking of
multiple targets using joint probabilistic data association, IEEE J.Ocean. Eng., vol. 8, no. 3, pp. 173184, Jul. 1983.
[8] M. Isard and A. Blake, Condensation-conditional density propagationfor visual tracking, Int. J. Comput. Vision, vol. 29, no. 1, pp. 528,1998.
[9] K. Okuma, A. Taleghani, N. De. Freitas, J. Little, and D. Lowe, Aboosted particle filter: Multitarget detection and tracking, in Proc. Eur.Conf. Comput. Vision, 2004, pp. 2839.
[10] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, Tracking in lowframe rate video: A cascade particle filter with discriminative observersof different lifespans, inProc. IEEE Comput. Soc. Conf. Comput. VisionPattern Recognit., Jun. 2007, pp. 18.
[11] M. D. Breitenstein, F. Reichlin, B. Leibe, E. K. Meier, and L. V. Gool,Robust tracking-by-detection using a detector confidence particle filter,in Proc. IEEE Int. Conf. Comput. Vision, Sep. 2009, pp. 15151522.
[12] M. Yang, F. Lv, W. Xu, and Y. Gong, Detection driven adaptive multi-cue integration for multiple human tracking, in Proc. IEEE Int. Conf.Comput. Vision, Sep. 2009, pp. 15541561.
[13] C. Shan, T. Tan, and Y. Wei, Real-time hand tracking using a meanshift embedded particle filter, Pattern Recognit., vol. 40, no. 7, pp.19581970, 2007.
[14] Y. Cai, N. De. Freitas, and J. J. Little, Robust visual tracking formultiple targets, inProc. Eur. Conf. Comput. Vision, 2006, pp. 107118.
[15] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking,IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564577,May 2003.
[16] Z. H. Khan, I. Y. Gu, and A. G. Backhouse, Robust visual objecttracking using multi-mode anisotropic mean shift and particle filters,
IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 1, pp. 7487,Jan. 2011.
-
7/27/2019 Multihuman Tracking Based on a SpatialTemporal
13/13
SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 373
[17] J. F. Henriques, R. Caseiro, and J. Batista, Globally optimal solutionto multi-object tracking with merged measurements, in Proc. IEEE Int.Conf. Comput. Vision, Nov. 2011, pp. 24702477.
[18] Z. Wu, T. H. Kunz, and M. Betke, Efficient track linking methodsfor track graphs using network-flow and set-cover techniques, in Proc.
IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 11851192.
[19] W. Brendel, M. Amer, and S. Todorovic, Multiobject tracking asmaximum weight independent set, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2011, pp. 12731280.
[20] M. Andriluka, S. Roth, and B. Schiele, People-tracking-bydetectionand people-detection-by-tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.
[21] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, Part-basedmultiple-person tracking with partial occlusion handling, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012, pp.18151821.
[22] H. Izadinia, I. Saleemi, W. Li, and M. Shah, (MP)2T: Multiple peoplemultiple parts tracker, in Proc. Eur. Conf. Comput. Vision, Oct. 2012,pp. 100114.
[23] Y. Li, C. Huang, and R. Nevatia, Learning to associate: Hybrid-boostedmulti-target tracker for crowded scene, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2009, pp. 29532960.
[24] C. Kuo, C. Huang, and R. Nevatia, Multi-target tracking by on-linelearned discriminative appearance models, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2010, pp. 685692.
[25] C. Kuo and R. Nevatia, How does person identity recognition helpmulti-person tracking? in Proc. IEEE Comput. Soc. Conf. Comput.
Vision Pattern Recognit., Jun. 2011, pp. 12171224.[26] B. Yang and R. Nevatia, Multi-target tracking by online learning of
non-linear motion patterns and robust appearance models, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012,pp. 19181925.
[27] R. E. Kalman, A new approach to linear filtering and predictionproblems, Trans. ASME J. Basic Eng., vol. 82, no. 1, pp. 3545, 1960.
[28] S. Kwak, W. Nam, B. Han, and J. H. Han, Learning occlusion withlikelihoods for visual tracking, in Proc. IEEE Int. Conf. Comput. Vision,Nov. 2011, pp. 15511558.
[29] J. Fan, X. Shen, and Y. Wu, Scribble tracker: A matting-based approachfor robust tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 8, pp. 16331644, Aug. 2012.
[30] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 14091422,Jul. 2012.
[31] J. Fan, Y. Wu, and S. Dai, Discriminative spatial attention for robust
tracking, in Proc. Eur. Conf. Comput. Vision, Sep. 2010, pp. 480493.[32] S. T. Birchfield and S. Rangarajan, Spatiograms versus histograms
for region-based tracking, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. Jun. 2005, pp. 11581163.
[33] J. Lin, Divergence measures based on the shannon entropy, IEEETrans. Inf. Theory, vol. 37, no. 1, pp. 145151, Jan. 1991.
[34] B. Yang and R. Nevatia, Online learned discriminative part-basedappearance models for multi-human tracking, in Proc. Eur. Conf.Comput. Vision, Oct. 2012, pp. 484498.
[35] H. W. Kuhn, The hungarian method for the assignment problem,NavalRes. Logistics Quart., vol. 2, nos. 12, pp. 8397, 1955.
[36] S. Kullback and R. Leibler, On information and sufficiency,Ann. Math.Stat., vol. 22, no. 1, pp. 7986, 1951.
[37] A. Ulges, C. Lampert, D. Keysers, and T. Breuel, Spatiogram basedshot distances for video retrieval, in Proc. Online Text Retrieval Conf.Video Retrieval Eval., 2006, pp. 110.
[38] Y. Shen, Z. Miao, and J. Zhang, Unsupervised online learning trajectory
analysis based on weighted directed graph, in Proc. IEEE Int. Conf.Pattern Recognit., Nov. 2012, pp. 13061309.
[39] National Institute of Standards and Technology. Trecvid 2008Evaluation for Surveillance Event Detection [Online]. Available:http://www.nist.gov/speech/tests/trecvid/2008/
[40] Caviar Dataset, (2004, Jan.) EC Funded CAVIAR project/IST 200137540. [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
[41] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, A mobile vision systemfor robust multi-person tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.
[42] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multi-camera peopletracking with a probabilistic occupancy map,IEEE Trans. Pattern Anal.
Mach. Intell., vol. 30, no. 2, pp. 267282, Feb. 2008.[43] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, Multiple object tracking
using k-shortest paths optimization, IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 9, pp. 18061819, Sep. 2011.
[44] Q. Zhu, S. Avidan, M. C. Yeh, and K. T. Cheng, Fast human detectionusing a cascade of histograms of oriented gradients, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2006, pp.14911498.
[45] X. Wang, T. X. Han, and S. Yan, An HOG-LBP human detector withpartial occlusion handling, in Proc. IEEE Int. Conf. Comput. Vision,Oct. 2009, pp. 3239.
[46] C. Huang and R. Nevatia, High performance object detection bycollaborative learning of joint ranking of granules features, in Proc.
IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2010,pp. 4148.
[47] B. Wu and R. Nevatia, Tracking of multiple, partially occluded humansbased on static body part detection, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2006, pp. 951959.
Yuan Shen received the B.E. and M.E. degrees in2008 and 2010, respectively, from Beijing JiaotongUniversity, Beijing, China, where he is currentlyworking toward the Ph.D. degree.
His research interests include pattern recognition,machine learning, video surveillance, multiobjecttracking, and trajectory analysis.
Zhenjiang Miao (M11) received the B.E. degreefrom Tsinghua University, Beijing, China, in 1987and the M.E. and Ph.D. degrees from NorthernJiaotong University, Beijing, in 1990 and 1994,respectively.
From 1995 to 1998, he was a Post-Doctoral Fellowwith Ecole Nationale Superieure dElectrotechnique,dElectronique, dInformatique, dHydraulique etdes Telecommunications, Institut National Polytech-nique de Toulouse, Toulouse, France, and was aResearcher with Institute National de la Recherche
Agronomique, Sophia Antipolis, France. From 1998 to 2004, he was withInstitute of Information Technology, National Research Council Canada,Nortel Networks, Ottawa, ON, Canada. He joined Beijing Jiaotong University,Beijing, in 2004, where he currently is a Professor. His research interestsinclude image and video processing, multimedia processing, and intelligent
humanmachine interaction.