multihuman tracking based on a spatial–temporal

Upload: tariq76

Post on 13-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    1/13

    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014 361

    Multihuman Tracking Based on a SpatialTemporal

    Appearance MatchYuan Shen and Zhenjiang Miao, Member, IEEE

    AbstractIn this paper, we focus on the improvements ofappearance representation for multihuman tracking. Many pre-vious methods extracted low-level appearance features, suchas color histogram and texture, even combined with spatialinformation for each frame. These methods ignore the temporaldistribution of features. The features of each frame may notbe stable due to illumination, human pose variation, and imagenoise. In order to improve it, we propose a novel appearancerepresentation called the spatialtemporal appearance modelbased on the statistical distribution of Gaussian mixture model(GMM). It represents the appearance of a tracklet as a wholewith dynamic spatial and temporal information. The spatial

    information is the dynamic subregions. The temporal informationis the dynamic duration time of each subregion. Each subregionis modeled as the weighted Gaussian distribution of GMM.The online expectation-maximization (online EM) algorithm isused to estimate the parameters of GMM. Then, we proposea tracklet association method using Bayesian prediction andJensenShannon divergence. The Bayesian prediction is used topredict the locations of targets. The JensenShannon divergenceis used to compute the distance of spatialtemporal appearancedistribution between two tracklets. Finally, we test our approachon four challenging datasets (TRECVID, CAVIAR, ETH, andEPFL Terrace) and achieve good results.

    Index TermsJensenShannon divergence, multihumantracking, online EM, spatialtemporal appearance.

    I. Introduction

    MULTIHUMAN tracking in complex environments has

    become more and more important in the field of com-

    puter vision research. It has many applications, such as video-

    based surveillance and humancomputer interaction. Its aim is

    to locate targets, retrieve their trajectories, and maintain their

    identities through a video sequence. The main challenging

    problem is the frequent occlusions of targets in crowded

    scenes.

    Manuscript received November 7, 2012; revised February 23, 2013, May 26,2013, and July 12, 2013; accepted August 2, 2013. Date of publicationAugust 29, 2013; date of current version March 4, 2014. This work wassupported in part by NSFC 61273274 and NSFB 4123104, in part by the 973Program 2011CB302203, in part by the National Key Technology Researchand Development Program of China under Grant 2012BAH01F03, in part bythe Ph.D. Programs Foundation of Ministry of Education of China under Grant20100009110004, and in part by the Tsinghua-Tencent Joint Laboratory forIIT. This paper was recommended by Associate Editor C. Shan.

    The authors are with the Institute of Information Science, Beijing Jiao-tong University, Beijing 100044, China (e-mail: [email protected];[email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TCSVT.2013.2280073

    In order to solve the challenging problem of human tracking,

    the classical tracking methods mainly follow the framework

    based on particle filter. However, it is difficult to track targets

    with long time full occlusions in crowded scenes since there

    are no observations to guide the trackers. In recent years, due

    to the improvements of human detection performance, tracking

    by global association has become more and more popular.

    This scheme has a general framework to track targets. It links

    detection responses in consecutive frames to build tracklets,

    which are short tracks for further analysis. Then, an association

    algorithm is used to associate tracklets for final trackingresults. Considering the information from future frames, some

    detection errors, such as missed detections and false alarms,

    can be corrected, and the long time full occlusions can also

    be solved.

    Most of the global association methods fuse several features

    as the affinity measurement [1][3], such as appearance, mo-

    tion, position, and size. They always used filter-based methods

    to extract motion, position, and size, but still used low-level

    image features to represent the entire human appearance, such

    as color histogram and texture. Some appearance features with

    spatial information are proposed to improve low-level image

    features [4], [5]. Though spatial information can improve low-

    level image features in partial occlusions, the state-of-the-art methods ignore an important case about full occlusions.

    The existing methods always use the latest appearance model

    learned by online update to track targets, and discard earlier

    appearance gradually. When a target is fully occluded for a

    long time and reappears, it is difficult to estimate whether

    its appearance is more similar to the latest or the earlier

    appearance model. If the target appearance is more similar to

    the earlier appearance model, the tracker would drift based

    on the latest appearance model to measure the similarity

    before and after occlusions. A simple way is to collect the

    latest and earlier appearance features of a target into a set.

    With the increase of frame steps, there would be a lot of

    feature samples. It is time consuming to search the most

    similar appearance features in such a large sample space.

    In order to record the latest and earlier appearance features,

    and maintain the spatial and temporal information of these

    appearance features including their spatial layout and temporal

    order, we need to explore a new appearance representation.

    Based on the above motivation, in this paper, we propose a

    new appearance model called spatialtemporal appearance in

    the field of multihuman tracking and use this new appearance

    model to track multitarget. We still use the tracking framework

    1051-8215 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    2/13

    362 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    of global association. For each tracklet, we perform the auto-

    clustering method based on online expectation-maximization

    (online EM) algorithm to cluster the appearance feature in

    space and time. We adopt RGB color space to represent

    human appearance. Pixels with similar colors are clustered

    into the same class in space and time. We apply the Gaussian

    mixture model (GMM) to represent the classes of the tracklet.

    Each class is a subregion of appearance and modeled by the

    Gaussian distribution which represents the color distribution,dynamic spatial layout, and time duration of subregions. The

    appearance of each tracklet is represented as spatialtemporal

    statistical distribution. Based on this distribution of appear-

    ance, we can obtain the stable appearance feature, which is

    not interfered by pose, illumination variation, image noise,

    and so on.

    To the best of our knowledge, the main contribution

    in this paper is the novel appearance representation called

    spatialtemporal appearance. It not only records the dynamic

    spatial layout of appearance, but also maintains the dynamic

    duration time of each subregion of appearance including the

    latest and the earlier frames of the whole tracklet. This

    appearance model can provide more information than that of

    previous methods for tracklet association. In order to associate

    tracklets for final tracking results using this spatialtemporal

    appearance model, we propose a tracklet association method

    using Bayesian prediction with fuzzy search range, and use

    JensenShannon Divergence to compute the similarity of

    spatialtemporal appearance.

    The rest of the paper is organized as follows. Related work

    is discussed in Section II. The overview of our approach is

    given in Section III. The spatialtemporal appearance and

    tracklet association are presented in Section IV. Section V

    shows some implementation details. The experimental results

    and discussions are shown in Section VI. Some conclusions

    are given in Section VII.

    II. Related Work

    Object tracking is a hot research field in computer vision

    for many years. Many methods have been proposed. The early

    works are multihypothesis tracking (MHT) [6] and joint proba-

    bilistic data association filters (JPDAF) [7]. MHT enumerated

    all possible hypotheses of the target and selected the most

    likely hypothesis as its optimal solution. With the increase

    of the number of targets and time steps, the original MHT

    method will encounter difficulties in computational cost of

    hypotheses. The JPDAF method maintained a joint probabilityamong tracking targets in each frame. When new targets enter

    the field of camera view or old targets leave the view, the joint

    probability needs to be recomputed.

    In recent years, the particle filter [8] is a widely used

    framework due to its robust performance. Many improved

    methods have been proposed. Some methods aim at the

    combination of particle filter and detection results. Okuma

    et al. [9] combined particle filter with the Ada-boost detection

    results to track an unknown number of objects. Li et al. [10]

    used multiple detectors to form a cascade particle filter to

    enhance the computational speed. The order in which the

    detectors were applied was determined based on their com-

    putational costs: the faster the earlier. Breitenstein et al. [11]

    proposed the continuous confidence of pedestrian detectors,

    and used it as a graded observation model to guide particle

    filter trackers. Yang et al. [12] used detection responses to

    update trackers and extracted multicue features to track targets

    including color model, elliptical head model, and bags of

    local features. Other methods focus on the improvements

    of sampling efficiency. Shan et al. [13] and Cai et al. [14]embedded the mean shift [15] algorithm into particle filter to

    improve sampling efficiency of particles to track hands and

    multiple persons, respectively. Khan et al. [16] improved the

    mean shift embedded method by using multimode anisotropic

    mean shift. The particle filterbased tracking methods are

    suitable for online applications since their results are only

    based on the past frames. These methods do not consider the

    information of future frames. When targets are fully occluded

    for a long time, these approaches may yield identity switches

    or trajectory fragments since there are no observations to guide

    the trackers.

    In contrast to these methods, which only consider the past

    information, many global data association approaches have

    been proposed. Global data association considers not only past

    frames but also future frames. These methods track targets and

    deal with occlusions by finding the best matches before and

    after occlusions. They build tracklets based on the detection

    responses in consecutive frames and perform association al-

    gorithms on these tracklets for final tracking results. Some

    researchers call them tracking by tracklet association.

    Huang et al. [1] presented a hierarchical association ap-

    proach. They built reliable tracklets based on object position,

    size, and color histogram of appearance, and used Hungar-

    ian algorithm to associate tracklets based on these features.

    Finally, they built an entry and exit map to specify the initial-

    ization/termination of each tracklet in the scene to enhance

    the performance of data association. Xing et al. [2] used

    particle filter to refine tracklets and used Hungarian algorithm

    to associate tracklets based on color histogram of appearance,

    size, and motion of targets. Henriques et al.[17] added merges

    and splits measurement of targets to improve the Hungarian

    association. Wu et al. [18] used network flow to associate

    tracklets only based on motion features. Brendel et al. [19]

    formulated the network flowbased tracklet association as

    the maximum weight independent set problem, and applied

    linear programming to solve it. The methods in [20], [21],

    and [22] detected body-parts in tracklets to extract local

    appearance features and applied Viterbi, greedy algorithmand network flow to associate tracklets, respectively. Some

    researchers applied machine learning algorithm to solve the

    tracking problem. Li et al. [23] extracted multiple features to

    build a feature pool, including color histogram, tracklet length,

    motion and so on, and presented a HybridBoost algorithm to

    learn the affinity models between two tracklets. The method

    in [3] added pairwise features to improve the feature pool

    of [23] and presented CRF-based tracklet affinity models.

    Kuo et al. [24], [25] learned an Ada-boost appearance model

    to distinguish targets. The tracklet association of [24] was

    based on the Ada-boost appearance model. The method of

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    3/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 363

    [25] improved the association of [24] by adding motion and

    time gap features and was solved by Hungarian algorithm.

    Yang et al.[26] improved the Ada-boost appearance model of

    [24] and [25] using multiple instance learning and proposed

    nonlinear motion pattern-based tracklet association, which was

    solved by Hungarian algorithm.

    All of these tracklet associationbased methods mainly

    focus on the performance improvements using different as-

    sociation algorithms. They neglect the importance of featurerepresentation, especially in appearance features. They always

    use filter-based methods, such as the Kalman filter [27],

    to extract features of motion, position, etc. For appearance

    features, they only extract low-level features from the entire

    human, such as color histogram, and texture. These appearance

    features may not work well in the case of partial occlusions,

    illumination variations and so on.

    In order to improve appearance representation, some new

    methods are proposed by combining spatial information. The

    methods in [4], [28], and [29] divided the entire tracking

    region of each frame into a fixed number of subregions. The

    tracking of the entire object is converted into estimating the

    similarity of each subregion. Kalal et al. [30] built ensemble

    classifier to track targets based on the PN learning method.

    They generated pixel comparisons offline at random and

    stayed fixed in runtime. These pixel comparisons recorded the

    pixel locations and feature distance. Besides the fixed spatial

    information, some methods with dynamic spatial information

    are proposed. Fan et al. [31] proposed a dynamic subregion

    method, which is called attentional regions (AR) to track

    targets. Local ARs were searched based on gradient and

    identified based on branch-and-bound procedure to determine

    the target location. Low similarity ARs would be removed

    and replaced by new ARs. Birchfield et al. [32] proposed a

    spatiogram appearance model to track targets based on mean

    shift. The spatiogram contained the spatial means and covari-

    ances for each color histogram bin. Wang et al. [5] proposed a

    spatialcolor mixture of Gaussians (SMOG) appearance model

    for particle filters. The appearance model was represented as a

    fixed number of Gaussians in runtime. Spatial information and

    color distribution were computed for each model of SMOG.

    Most of the proposed methods use the definition of fixed

    spatial layout [4], [28][30]. It cannot satisfy the dynamic

    requirements especially in nonrigid targets, such as pose

    changes. Though some methods with dynamic spatial infor-

    mation [5], [31], [32] improve the fixed spatial layout, these

    methods still have their own weaknesses. They always use

    the latest appearance data of objects to update the appearancemodel. With the increase of frame steps, the model will

    forget the earlier appearance gradually. When a target is fully

    occluded for a long time, the appearance models of these

    methods stop updating during occlusions since there are no

    observations to update appearance models. Due to the complex

    of real scenes, the target appearance may have some variations

    when it reappears after occlusions. This may be caused by

    illumination variations, even the changes of camera perspective

    due to the target movements. In this case, it is difficult for these

    methods to estimate the appearance similarity of the target

    using the latest appearance model. If the target appearance is

    more similar to the earlier appearance model, the similarity

    measurement of the latest model would fail.

    Our spatialtemporal appearance model will improve the

    appearance representations, which are mentioned above, and

    promote the tracking performance. Our appearance model not

    only provides the dynamic spatial layout of appearance of

    each target including the dynamic number and locations of

    subregions, but also provides the dynamic duration time of

    each subregion. The temporal distribution of appearance notonly records the latest appearance model, but also records

    the earlier appearance model. We can dynamically select the

    most similar appearance model to associate tracklets for final

    tracking results.

    Our association method uses the Bayesian motion prediction

    with fuzzy search range to guide the appearance association

    of tracklets. Compared with MHT, our method only predicts

    the most likely motion direction instead of all possible pathes

    which need more computational cost. In order to compensate

    the imprecise prediction position, we add the fuzzy search

    strategy.

    III. Overview of Our Approach

    We adopt the framework of tracking by tracklet association

    in our approach. This framework can be mainly formalized as

    (1). L is the set of association results L = {l1,...,lN}. Each

    element of L represents whether two tracklets can be associ-

    ated.Sis the set of tracklets S={TR1,...,TRM}.f()is a cost

    function to associate tracklets. Equation (1) represents that the

    association results are the best matches of tracklets subject to

    the nonoverlap restriction. The nonoverlap restriction means

    that two tracklets can not have the overlapping duration time,

    and two association results can not share the same tracklet

    L

    = arg maxL f(L|S)

    Subject to

    T Ri T Rj=,i, j M

    lp lq =,p, q N.

    (1)

    In previous methods, such as in [1][3], [23][26], each

    tracklet TRi is represented by a sequence of detection re-

    sponses TRi =

    rts

    i ,...,rte

    i

    , where rt

    s

    i and rte

    i denote the start

    and end frames of tracklet TRi. The features of each tracklet

    are extracted from each detection response independently. For

    example, each response is represented as rti =

    ati, pti, s

    ti, v

    ti

    in frame t, where a ti is the appearance, p

    ti is the position, s

    ti is

    the size, and vti is the velocity.

    In this paper, we improve the feature extraction of appear-

    ance by building spatialtemporal appearance model for eachtracklet instead of appearance extraction from each response

    in each frame independently. Therefore, in our method, the

    complete representation of TRi is TRi =

    Gi,

    rts

    i ,...,rte

    i

    ,

    where Gi is the spatialtemporal appearance model of the

    whole tracklet TRi. Then, Gi is used to associate tracklets.

    The flow chart of our approach is shown in Fig. 1. First,

    reliable tracklets are built from the detection responses. Then,

    there are two main components. One is the spatialtemporal

    appearance model that is built from each tracklet. We extract

    spatialtemporal appearance based on color feature. Similar

    colors are clustered in space and time based on online EM

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    4/13

    364 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    Fig. 1. Overview of our approach. In the framework of tracking by tracklet association, a new appearance representation called spatialtemporal appearancemodel is proposed and used to associate tracklets.

    algorithm. The classes of the tracklet are formulated as a

    GMM. Each class corresponds to a Gaussian distribution.

    The appearance of each tracklet is modeled as a whole

    rather than independently for each frame. The other is the

    tracklet association. We predict the locations of targets in the

    current frame to obtain the motion cost between two tracklets,

    and apply Gaussian selection algorithm to the GMM of

    spatialtemporal appearance to find the GMM subsets which

    are the minimal appearance distance between two tracklets.

    JensenShannon divergence (JSD) is used to compute the ap-

    pearance cost between two tracklets. Spatialtemporal appear-ance cost and motion cost are used to associate tracklets for

    final tracking results. The spatialtemporal appearance model

    and the tracklet association are described in Sections IV-A

    and IV-B, respectively.

    IV. SpatialTemporal Appearance and

    Tracklet Association

    A. SpatialTemporal Appearance

    Given the set of reliable tracklets, we start to extract

    spatialtemporal appearance for each tracklet. In order to

    achieve this goal, we apply GMM for this task. Each Gaussian

    distribution represents spatialtemporal color distribution of

    pixels which belong to the same class. Each Gaussian distri-

    bution corresponds to a class. The weight of each Gaussian

    model represents whether the Gaussian distribution is impor-

    tant or not: the greater weight, the higher importance. Here, we

    apply the online EM algorithm to estimate the parameters of

    GMM rather than offline EM since the association of tracklets

    is designed from previous frames until the current frame. With

    the increase of frame steps, tracklets which are not terminated

    in the current frame may grow in the future frames. The online

    EM can satisfy this case when the data is being added for

    each future frame to update the parameters of GMM. If we

    use offline EM to estimate the parameters of GMM, with theaddition of new data of future frames, the parameters of GMM

    must be recomputed completely using the new sample space

    including old samples and new samples. This would waste lots

    of time due to recomputing the distribution of all samples. That

    means tracklets cannot compute the appearance distribution

    using offline EM until they are terminated completely before

    the current frame. Based on this reason, the online EM is more

    appropriate. It can achieve reasonable statistical results with

    the new data being added in the current frame. In order to

    show how we use online EM to estimate the parameters of

    GMM in our paper, we build Algorithm 1.

    Algorithm 1: Learning GMM appearance model

    Input: Tracklet TR= {rt}, initialize t at the beginning of TR,K = 0

    repeat1for each pixel x

    rti ofr t do2

    if K = 0 then3Initialize a new Gaussian distribution; K K + 14

    end5E-step: for k = 1to K do6

    ik = N

    xrti |k, k

    7

    end8

    if max (ik)< 1 then9Initialize a new Gaussian distribution; K K + 1;10break

    end11M-step: for k = 1to K do12

    ik = ik

    Km=1

    im

    ; Ck Ck + ik

    13

    newk = (1 ik

    Ck)oldk +

    ikCk

    xrti14

    newk = (1 ik

    Ck)oldk +

    ikCk

    (xrti

    newk )(x

    rti

    newk )

    T15

    k = Ck

    Km=1

    Cm16

    end17end18for a= 1to K do19

    for b= 1to K do20if JSD (Na||Nb) < then21

    ab = CaCa+Cb

    a+ CbCa+Cb

    b22

    ab = CaCa+Cb

    a+ CbCa +Cb

    b23

    ab = Ca+Cb

    Kk=1

    Ck24

    Tab = max

    Tenda , Tend

    b

    min

    Tstarta , T

    startb

    25

    end26end27

    end28

    Compute new end frame Tendk for each Gaussian Nk29

    Tk =Tend

    k Tstart

    k30

    t t+ 131until rt is the tail of TR ;32Output: GMM appearance distribution includes K Gaussian

    distributions

    First, we need to define some important variables before

    showing the algorithm of building spatialtemporal appear-

    ance. Suppose G is the GMM appearance distribution for

    a tracklet TR. Nk is one of the Gaussian distributions in

    the GMM. G = {Nk}, k is the index of each Gaussian

    distribution. For each Gaussian distribution Nk, we define

    Nk = {k, k, k, Tk}. k is the weight of the Gaussian

    distribution Nk. k is the mean. k is the covariance matrix.

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    5/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 365

    Finally, Tk is the duration time of the Gaussian distribution

    Nk. For k and k, they still include five parameters. The

    five parameters are position x and y which are normalized by

    the width and height of the human detection bounding box,

    color channelsR,G, andB, which are normalized by the value

    range of RGB color space. They are independent distributions.

    This is shown in (2). diag{} denotes a diagonal matrix

    k ={kx, ky, kR, kG, kB}T

    k =diag

    2kx, 2ky,

    2kR,

    2kG,

    2kB

    . (2)

    Here, we must further explain the definition of variables. For

    the position distribution, we use {kx, ky, kx, ky} to model

    it. Since there are some displacements for the location of the

    same class in each frame as a class crosses several frames, it is

    difficult to use a fixed position range to describe it. Due to this

    reason, we use Gaussian distribution to model the position of

    the class. In addition, the position of each class is the relative

    position in the detection bounding box, rather than the absolute

    position in the image.

    For the duration time of each Gaussian, it is used to

    constrain whether Gaussians can be a subset to compute the

    similarity with another subset. Only Gausians with overlapping

    duration time can be a subset. We do not use Gaussian

    distribution to model it since the variance cannot reflect the

    real duration time of each Gaussian. For example, the duration

    time of Gaussian a is from frame 1 to 10. The duration time of

    another Gaussianb is from frame 6 to 30. These two Gaussians

    have the overlapping time of five frames. If we use Gaussian

    to model the time dimension, the time distribution of Gaussian

    a is at= 5.5, at= 2.87, the time distribution of Gaussian b

    isbt= 18,bt= 7.2. It would be difficult to estimate whether

    two Gaussians are overlapping or not using the time variance.

    The algorithm of building spatialtemporal appearance is

    shown in Algorithm 1. The input of this algorithm is a

    trackletTR. r tis the detection response ofTR in frame t. We

    initialize the frame index tat the beginning of tracklet TR. The

    number of Gaussian distributionK is initialized to 0 before the

    algorithm is computed. For each tracklet TR, we compute the

    parameters of GMM distribution using online EM algorithm

    based on the detection response of each frame until the tail

    of tracklet TR. For each detection response rt of tracklet

    TR, we compute the similarity of each pixel of response

    rt with each Gaussian distribution and select the maximal

    similarity. If the maximal similarity is less than threshold 1,

    we initialize a new Gaussian distribution in GMM. Otherwise,

    all Gaussian distributions will be updated based on the online

    EM algorithm. From line 6 to 11 is the E-step of online EM.From line 12 to 17 is the M-step of online EM. In line 13,

    the similarity ik of each Gaussian distribution is normalized

    to ik and summed into Ck to form the final update factor ik

    Ck.

    The componentik of this update factor updates Gaussians by

    a proportion of their estimated posterior probability in each

    frame. The component Ck guarantees the stability of GMM

    parameters when new samples are being added, due to the

    accumulation of a large number of samples. After the online

    EM in each frame, we need to check whether two Gaussian

    distributions should be merged. Here, we use JSD [33] to

    estimate the distance between two Gaussian distributions. The

    Fig. 2. Illustration of spatialtemporal appearance model. Each ellipsoidrepresents a Gaussian distribution of appearance model and the duration timeof Gaussian distribution.

    definition of JSD distance will be described in (6) and (7)

    of Section IV-B. When the distance between two Gaussian

    distribution a and b is less thanwhich is a very small number,

    we merge them based on the amount proportion of Ca and

    Cb. This is shown from line 22 to 24. The duration time ofGaussian distribution a and b is still merged based on line 25.

    Finally, in lines 29 and 30, we update the duration time of

    each Gaussian distribution, and repeat the above algorithm

    for the next frame t+ 1 until the tail of tracklet TR. Based

    on this algorithm, the schematic diagram of spatialtemporal

    appearance is shown in Fig. 2.

    B. Tracklet Association

    After building the spatialtemporal appearance model for

    each tracklet, we start to associate them for final tracking

    results. The main idea of our tracklet association is based on

    the motion prediction of Bayesian methods, since the motion

    of targets in the current frame can be predicted based on the

    motion of several previous frames. In order to implement this

    strategy, in computer vision research, a popular prediction

    tool called Kalman filter [27] is typically used to predict

    the location of target. Kalman filter is a linear system with

    Gaussian noise based on the Markov chain, it predicts the

    target location only based on the information of last frame. By

    repeating the Kalman filter frame by frame, we can predict the

    location of the target. When a target is occluded, the reliable

    tracklet of the target is terminated. Then, based on the latest

    Kalman state of this tracklet, we can predict the approximate

    location of the target in the future frames when the target

    is occluded. However, the linear motion prediction may beimprecise in long frame gaps due to the nonlinear motion [34].

    We will propose a strategy to alleviate it in the following part.

    When we obtain the prediction location of the occluded

    target, in the normal case, we always search the target around

    the prediction location to track it continually. However, the

    range of around the prediction location is difficult to define

    precisely by a fixed boundary. We propose a fuzzy search

    method based on exponential distribution instead of the fixed

    boundary. The exponential distribution can be modeled by

    d(Lm,Ln). is the base of the exponential distribution. d()

    is the normalized Euclidean distance between the prediction

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    6/13

    366 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    location Lmand the target location of another tracklet Lnin the

    current frame. With the increase of search radius, the location

    distance is larger and larger. That means the probability of

    finding the target is lower and lower with the increase of search

    radius

    Sm,n =JSD (Gm||Gn) d(Lm ,Ln). (3)

    In order to associate tracklets to form the final trackingresults, we still need to use the spatialtemporal appearance

    model to measure the appearance similarity between tracklets.

    The association of tracklets can be represented as (3). Sm,nrepresents the similarity distance between two tracklets. If

    Sm,n is lower than any other tracklet pairs, the two tracklets

    can be associated. This can be solved by the Hungarian

    algorithm [35] in our approach. In fact, for the efficiency of

    implementation, the approach needs a time sliding window on

    the video sequence to compute the smallest distance of Sm,n,

    it does not process the whole video together. JSD () is the

    JSD to compute the similarity between two spatialtemporal

    appearance models. Gm and Gn are two GMM distributions.

    We will describe the details of JSD in the following part. The

    parameterd(Lm ,Ln) is the exponential distribution of the fuzzy

    search method. In order to estimate whether two tracklets can

    be a pair to compute the similarity, the nonoverlap restriction

    which is shown in (1) must be employed. This restriction is

    reasonable, since a person cannot belong to two tracks or

    appear in two places at the same time.

    The distance between two GMM distributions can be com-

    puted by using their weight, mean, and variance instead of

    comparing each sample pair. This can be solved by using

    JSD or KullbackLeibler divergence (KLD) [36]. We employ

    JSD to compute the similarity between two GMM distribu-

    tions instead of KLD, since KLD is a nonsymmetric and

    unnormalized distance. KLD is in the range of [0, +). The

    distance of KLD (Gm||Gn) is different from the distance of

    KLD (Gn||Gm). However, JSD is in the range of [0, 1], and

    it is a symmetric distance. Therefore, JSD is more suitable to

    evaluate the distance between two GMM distributions.

    The similarity between two GMMs cannot be still computed

    directly since different order of Gaussians in GMMs will lead

    to different distance. We need to explore a strategy to find the

    smallest distance between two GMMs. The smallest distance

    will lead to the strictest comparison in the global tracklet

    association. Furthermore, the appearance of a person must

    appear as a whole. His subregions of appearance must appear

    with overlapping duration time, otherwise, the combinationof subregions would be meaningless. In order to satisfy the

    above requirement, which is the smallest distance between

    two GMMs under the restriction of overlapping duration

    time, we present Algorithm 2 to select Gaussian distributions

    from two GMMs to compute the similarity. The selection of

    this algorithm is mainly based on the parameter Tk in each

    Gaussian distribution.

    The selection algorithm of Gaussian distribution is shown in

    Algorithm 2. We first input two GMM distributions ( Gm and

    Gn), which belong to two tracklets, respectively, and initialize

    two sets A and B to store the selected Gaussian distributions

    Algorithm 2: Gaussian selection from GMM

    Input: Two GMM distribution Gm and Gn, Initialize two setsA= B =

    for each Nm,i of Gm do1for each Nn,j of Gn do2

    mi,nj = JSD

    Nm,i||Nn,j

    3

    end4end5

    (i, j)= argmini,j

    mi,nj

    6

    Add Nm,i to A, Add Nn,j to B7repeat8

    Initialize two sets AT = and BT =9for each Nm,i of Gm,Nm,i /A and Tm,i TA = do10

    Add Nm,i to AT11for each Nn,j of Gn, Nn,j /B and Tn,j TB = do12

    Add Nn,j to BT13

    mi,nj = JSD

    Nm,i||Nn,j

    14

    end15end16if AT= and BT= then17

    (i, j) = argmini,j

    mi,nj

    18

    Add Nm,i to A, Add Nn,j to B19

    end20 if (AT= and BT = ) or (AT = and BT=) then21

    for each Gaussian NATp (orNBTq ) of AT (orB T) do22

    Find the Gaussian NB (or NA) in set B (or A) with23the minimal JSD distanceAdd NATp (or N

    BTq ) to A (or B )24

    Repeat NB (or NA) in B (or A)25end26

    end27until A and B do not increase ;28Normalize the weights of Gaussians in set A and B , respectively29Output: The set A and B, they are called as GMM

    distribution Gm and Gn, respectively

    The Gaussian number of Gm and G

    n are K

    m and K

    n,respectively

    K

    m = K

    n, we call them K

    from Gm andGn, respectively.Gm is a GMM distribution that

    belongs to a terminative tracklet. Gn is a GMM distribution

    that belongs to an initialized tracklet. First, from line one to

    seven, we compute the similarity of each pair of Gaussian

    distributions in Gm and Gn, and store the minimal distance

    pair

    Nm,i , Nn,j

    to the set A and B, respectively. Add Nm,i

    to A, add Nn,j to B . Then, from lines 8 to 28, we repeat the

    selection process until the set A and B do not increase.

    In lines 10 and 12 that have the same meaning, the con-

    ditions of loop are changed compared with line 1 and 2.

    For line 10, Nm,i does not belong to the set A before itcomputes the similarity. TA represents the duration time of

    each Gaussian distribution in set A. Tm,i TA = represents

    that the duration time Tm,i ofNm,i overlaps the duration time

    of each Gaussian distribution in set A, this is illustrated in

    Fig. 3(a). This condition ensures the Gaussian distributions

    in set A (or B) appear at the same time since the integrated

    appearance of a person must appear as a whole. However, the

    Gaussian distributions of the original input Gm (or Gn) may

    not appear simultaneously. Just like Fig. 3(b), some Gaussian

    distributions do not appear simultaneously. From line 17 to 27,

    we select the minimal distance pair to store them. When

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    7/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 367

    Fig. 3. Slection of Gaussian distribution

    AT = and BT = , we can directly compute the minimal

    distance of Gaussian distributions, and store them into A and

    B. It is shown from line 17 to 19. In line 21, if one of these

    two sets is empty set, we need to follow the line 22 to 26 to

    find a match. This is very important. If a Gaussian distribution

    in ATcan not find a match in BT, it will be discarded. This

    leads to missing features. We need to repeat some Gaussian

    distributions in set B to match them. Finally, in line 29, we

    normalize the weights of Gaussian distributions in set A and

    B, respectively.After the selection process, two setsAandB are output. The

    Gaussian distributions in set A are called as GMM distribution

    Gm. The Gaussian distributions in set B are called as GMM

    distribution Gn. The Gaussian distribution number ofGm and

    Gn are Km and K

    n, respectively. Since K

    m = K

    n, we call

    them K.

    Algorithm 2 shows the dynamic selection algorithm of

    Gaussian distribution for the minimal GMM distance. The

    dynamic selection may select the GMM appearance which

    appeared the earliest, or the latest, as long as the appearance

    distance between two tracklets is minimal.

    After the selection of Algorithm 2, the distance between

    two GMM distributions can be computed by (4). m,k andn,k are the weights of Gaussian distribution Nm,k and Nn,k .

    Equation (3) can be modified by (5)

    JSD

    Gm||Gn

    =

    Kk=1

    m,k +n,kK

    k=1

    m,k +n,k

    JSD Nm,k ||Nn,k (4)

    Sm,n = JSD

    Gm||Gn

    d(Lm ,Ln). (5)

    Finally, we need to compute the JSD distance between two

    single Gaussian distributions. In order to understand easily,we assume two single Gaussian distributions N1 and N2to compute the JSD distance. N1 corresponds to Nm,k , N2corresponds to Nn,k . The definition of JSD in line 21 of

    Algorithm 1 is the same as the descriptions of this part. N1and N2 correspond to Na and Nb in Algorithm 1, respectively

    JSD (N1||N2)

    =1

    2

    KLD

    N1||N

    +KLD

    N2||N

    (6)

    N =1

    2(N1+ N2) . (7)

    The general form of JSD is shown in (6) and (7). N1 =

    {1, 1, 1} and N2 = {2, 2, 2} are two Gaussian dis-

    tributions to compute the JSD. KLD () is the KLD. N =

    {, , } is the mixture distribution ofN1 and N2. In paper

    [37], N is computed by (8) and (9) without Gaussian weight

    =1

    2(1+ 2) (8)

    =

    1

    2

    1+ 2+

    T

    1 1+

    T

    2 2

    T

    . (9)Since our spatialtemporal appearance model is based on

    GMM distribution, the weight of each Gaussian distribution

    is still important. We modify (8) and (9) with the weight of

    Gaussian distribution. It is shown in (10), (11), and (12). KLD

    is shown in (13) and (14). Here, we compute the divergence of

    KLD

    N1||N

    as an example, the KLD

    N2||N

    is the same

    as it. P is the normalized Gaussian distribution (

    xP(x) = 1)

    and dim is the dimension of the Gaussian distribution

    =1

    2(1+ 2) (10)

    =(11+ 22)

    1+ 2

    (11)

    =1

    1+ T1 1

    1+ 2

    +

    2

    2+ T2 2

    1+ 2

    T

    (12)

    KLD

    N1||N

    =

    x

    N1(x) lnN1(x)

    N (x)

    =

    x

    1P1(x) ln1P1(x)

    P (x)

    =1ln1

    x P1(x) +1 x P1

    (x) lnP1(x)

    P (x)

    =1ln1

    +1

    x

    P1(x) lnP1(x)

    P (x).

    (13)

    ReplacingPwith Gaussian distribution function, we obtain

    KLD

    N1||N

    =1ln1

    +1

    1

    2

    ln

    |1|

    +tr

    1

    dim

    +

    1 T

    1

    1

    .

    (14)

    V. Implementation Details

    A. Building Reliable Tracklets

    We borrow the idea of [1] to build reliable tracklets for

    the same reasons in [3] and [23][26], since it is a simple

    and conservative method. This approach considers that the

    changes of the target are very small in two consecutive frames

    including position displacement, the changes of appearance

    and so on. The affinity is formalized as

    rt = arg m in

    d

    Lrit

    ,Lrt1

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    8/13

    368 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    frame t. Lrt1 represents the center location of the observation

    rt1. d() is the Euclidean distance. B () is the Hellinger

    distance of the color histograms between two observations, it

    is defined as

    1 BC

    rit, rt1

    , BC () is the Bhattacharyya

    coefficient. represents the range of the neighborhood posi-

    tion. In (15), we compute the appearance distance between two

    detection responses in the neighborhood position ofrt1, and

    link the response of minimal distance rt to the corresponding

    tracklet. This strategy is conservative and biased to link only

    reliable association between any two consecutive frames. In

    order to prevent unsafe association of responses, in [1], a

    boundary value of the affinity 2 is defined. In other words,

    two responses are linked if and only if their affinity distance

    is lower than the threshold 2 and significantly lower than the

    distance of any other pairs. In our method, the boundary value

    2 = 0.3 since the Hellinger distance is in the range of [0, 1].

    Since the human detector only gives the approximate size of

    detection responses, the size may be larger or smaller than the

    size of the ground truth target. In order to extract accurate

    features for the tracklet association, tracklet refinement is

    needed. Let st

    is the size of an observation rt

    in a tracklet

    TR, where t is the index of frames. The refinement can be

    formalized as

    st=

    i=1

    sti

    . (16)

    is a time sliding window in the refinement filter, and is set

    to five. The size of the detection responsertis refined based on

    the average size of previous frames. By repeating this equation

    until the tail of the tracklet, the size of each detection response

    in the tracklet can be smoothed.

    B. Entry and Exit Model

    In order to use Hungarian algorithm to compute the optimal

    association results of (3), the entry/exit model of scenes must

    be defined to specify the initialization/termination of each

    tracklet in the camera view. The entry/exit model includes

    two parts in our approach. One is the boundary of the camera

    view, the other is the entry/exit zones in the camera view.

    The boundary of the camera view is easy to specify the

    initialization/termination of a tracklet. When a target enters the

    image from the boundary of the camera view, its tracklet is

    initialized. When a target leaves the image from the boundary

    of the camera view, its tracklet is terminated. The entry/exit

    zones in the camera view need to be learned.

    We use our previous work [38] to finish this task. Thiswork is a trajectory analysis research that can also learn the

    entry/exit zones. In [38], each trajectory is represented as

    a sequence of key points and start/end points. The features

    of each key point are coordinates, turning angle (TA), and

    turning angle direction (TAD). For the start/end points, the

    features are only coordinates. Key points are used to learn

    the classification of trajectories. The start/end points are used

    to learn the entry/exit zones of scenes. In our approach, only

    coordinate features of start/end points of each trajectory are

    extracted without key points, since we only use the method of

    [38] to learn the entry/exit zones without the task of trajectory

    classification. Start/end points of trajectories are clustered

    using unsupervised EM algorithm. Each class is modeled as

    Gaussian distribution, and represented as

    x, y, 2x ,

    2y

    . x

    and y are the means of coordinates x and y. 2x and

    2y

    are their variances. Finally, the classes of start/end points of

    trajectories are the entry/exit zones.

    VI. Experiments

    A. Dataset and Test Method

    1) Dataset: We use four challenging datasets to test our

    approach. These four datasets are the TRECVID 2008 [39],

    the CAVIAR dataset [40], the ETH Mobile Platform dataset

    [41], and the Terrace sequence of EPFL dataset [42], [43]. All

    of these datasets include many mutual occlusions in crowded

    scenes.

    Many state-of-the-art human detectors can be adopted to

    detect human responses, such as [44][46] and so on. For

    TRECVID, ETH, CAVIAR datasets, to compare with state-

    of-the-art tracking approaches fairly, we adopt the human

    detection results that are used in [25] since these detectionresults are also used in [1], [3], [23], [24], and [26]. For the

    Terrace sequence, we use the method of [44].

    The TRECVID 2008 dataset is hundreds of hours from

    five fixed cameras covering different field-of-views. We follow

    the data setting of the paper [23] since they first used this

    dataset. Their setting is also used in [1], [3], and [24][26]. The

    CAVIAR dataset is captured in a shopping center corridor by

    two fixed cameras from two different viewpoints and contains

    26 video sequences. We follow the setting of [25] which

    selected the 20 videos only using the corridor view. Their

    setting is also used in previous works, such as [1], [23], [24],

    and [26]. The ETH dataset is captured by a stereo pair of

    forward-looking moving cameras in a busy street scene. Due

    to the lower position of the cameras, full occlusions also

    often happen in these videos. We follow the dataset setting

    proposed in [25] and its ground truth. In [25], only left view of

    videos was used without stereo depth maps. The EPFL Terrace

    sequence uses four cameras with an overlapping camera view.

    The objects appear very long time, and their appearance is

    varying more often. Since our method only uses a monocular

    camera, we only use the data of camera 0 to test our approach.

    The details of these datasets are shown in Table I.

    2) Test Method and Evaluation Metrics: We conduct four

    experiments to evaluate the effectiveness of our approach.

    The first experiment tests how the parameter value affectsthe tracking performance. The second experiment compares

    our approach with several state-of-the-art methods on the

    TRECVID, ETH, and CAVIAR dataset. The comparison

    methods include [1][3], [23][26] since these methods are

    the latest results and can reflect the best performance of

    tracking by tracklet association. The third experiment tests the

    robustness of our approach using the terrace sequence of EPFL

    dataset, since the objects appear hundreds even thousands of

    frames, and the appearance of object is varying more often in

    these sequences. The fourth experiment is the computational

    cost of our approach.

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    9/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 369

    TABLE I

    Dataset

    TABLE II

    Discussion of Parameter Values

    In our experiments, to compare with state-of-the-art ap-

    proaches fairly, we adopt the metrics used in [23] to eval-

    uate the tracking performance since all of these comparison

    methods follow the metrics.

    The metrics in [23] are an improved version of the original

    metrics in paper [47]. In [23], the track fragments and ID

    switches are more strict but better defined than the original

    definition in [47]. The metrics are as follows.

    1) Ground truth (GT): The number of trajectories in the

    ground truth.

    2) Mostly tracked trajectories (MT): The percentage of

    trajectories that are successfully tracked for more than

    80% divided by GT.

    3) Partially tracked trajectories (PT): The percentage of

    trajectories that are tracked between 20% and 80%

    divided by GT.

    4) Mostly lost trajectories (ML): The percentage of trajec-

    tories that are tracked for less than 20% divided by GT.

    5) Fragments (Frag): The total number of times that a

    trajectory in GT is interrupted.

    6) ID switches (IDS): The total number of times that a

    tracked trajectory changes its matched GT identity.

    Since multiobject tracking can be viewed as a method that

    is able to recover missed detections and remove false alarms

    from the raw detection responses, the metrics for detection

    evaluation are provided.

    1) Recall: The number of correctly matched detections

    divided by the total number of detections in ground truth.

    2) Precision: The number of correctly matched detections

    divided by the number of output detections.

    3) False Alarm per Frame (FAF): The number of false

    alarm per frame.

    The higher value, the better is the performance in MT, recall

    and precision, the lower value, the better is the performance

    in PT, ML, Frag, IDS, and FAF.

    B. Performance

    In the first experiment, we test how the parameter value

    affects the tracking performance on the TRECVID, ETH,

    and CAVIAR dataset. We only use tracking metrics MT,

    PT, ML, Frag, IDS to evaluate this experiment since the

    metrics of human detection such as Recall, Precision, and FAF

    cannot reflect the changes of tracking performance directly.

    The results are shown in Table II. The best results are shown

    using bold face. With the increase of the parameter , the

    tracking performance goes up. When = 100, we obtain

    the best performance on three datasets. If the parameter

    increases continually, such as = 150, the performance on

    three datasets cannot increase , even decrease. This is because

    the parameter affects the search range of tracklet association.

    With the increase of , the value of exponential distribution

    increases faster and faster. If is too large, a small distance

    will lead to a very large value of d(Lm ,Ln). This large value

    will dominate appearance features and decrease the trackingperformance. From this experiment, = 100 is appropriate. In

    the subsequent experiments, we will use = 100 to test our

    approach. For other parameters, 1 = 0.1, it is in Algorithm 1,

    2 = 0.3, = 5, they are in Section V-A.

    In the second experiment, our method is compared with

    state-of-the-art methods based on the complete metrics on

    three challenging datasets TRECVID, CAVIAR, and ETH. We

    show test results in Tables IIIV. From the three tables, the

    metrics recall, precision, and FAF are very close to previous

    methods. These metrics indicate the misses and false alarms

    of detections. Even though some of them outperform previous

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    10/13

    370 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    TABLE III

    Comparison of Tracking Results on TRECVID Dataset

    TABLE IV

    Comparison of Tracking Results on CAVIAR Dataset. *The Number of Frag and IDS Followed the Metrics in [47]

    TABLE V

    Comparison of Tracking Results on ETH Dataset

    methods, the enhancement is very limited. The first reason is

    that we use the same detection results as previous methods.

    The second reason is that our tracklet building method is simi-

    lar to previous approaches that are based on the method in [1].

    Furthermore, we do not propose new methods to recover de-

    tections in this paper. Our method focuses on the performance

    improvement of tracklet association for tracking metrics based

    on the spatialtemporal appearance model. Therefore, if the

    performance of human detectors or the methods of detection

    recovery cannot be improved, the performance enhancement

    of human detection is limited only relying on the refinement

    of tracking algorithms. For other metrics, we analyze them as

    follows.

    Table III shows the tracking result on TRECVID dataset.

    Our performance outperforms the previous methods. The MTis higher, and the PT and ML are lower than previous methods.

    For the metrics Frag and IDS, our method is much less than

    previous methods. Especially, though the methods of [25]

    and [26] have reduced a lot of Frag and IDS, our method

    still outperforms them. For the dataset CAVIAR and ETH,

    though the enhancement of performance is not higher than

    TRECVID dataset, we also obtain better performance than

    previous methods.

    For CAVIAR dataset in Table IV, the performance enhance-

    ments of our approach are the MT, PT and IDS compared

    with previous methods. The metric of Frag is very close to

    [26] and outperforms other state-of-the-art methods. Some

    fragments (Frag) and ID switches (IDS) are corrected based

    on our method. We reduce the number of Frag and IDS. The

    reduction of Frag and IDS causes some partially tracked (PT)

    trajectories become mostly tracked (MT) trajectories. Due to

    this reason, the metric MT is higher than previous methods,

    and the PT is lower than them.

    Finally, the performance of ETH dataset is shown in

    Table V. From the results, the performance enhancement of

    our method is limited. Due to the view of forward-looking

    moving cameras, a lot of people are difficult to detect, such as

    too small persons without stable detection responses. Due to

    this reason, almost 40% trajectories are partially tracked (PT)

    or mostly lost (ML) in our method and [25]. Without detection

    responses, it is difficult to track these PT trajectories to formmostly tracked (MT) trajectories. Using our method, on ETH

    dataset, the performance of our approach is still superior to

    the method of [25], such as the higher of MT, the lower of

    PT and Frag.

    From the three tables, we analyze the enhancement of track-

    ing performance based on the metrics. However, the essential

    reason of the enhancement is the spatialtemporal appear-

    ance model. Previous methods always extract low-level image

    features to represent appearance, such as color histogram,

    texture and so on. These features may not be reliable due

    to partial occlusions, illumination, and human pose variation,

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    11/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 371

    Fig. 4. Snapshots of TRECVID dataset. (a) Frame 879. (b) Frame 906. (c) Frame 918. (d) Frame 951. (e) Frame 973.

    Fig. 5. Snapshots of CAVIAR dataset. (a) Frame 767. (b) Frame 854. (c) Frame 903. (d) Frame 958. (e) Frame 1049.

    Fig. 6. Snapshots of ETH dataset. (a) Frame 88. (b) Frame 97. (c) Frame 124. (d) Frame 130. (e) Frame 142.

    TABLE VI

    Performance on Terrace Dataset

    since they just include the appearance in its own frame, and

    do not include appearance information of sequential frames to

    refine the appearance feature. Our spatialtemporal appearance

    model solves this problem based on statistical distribution and

    includes dynamic spatial and temporal information of appear-

    ance. The dynamic spatial information is the dynamic number

    and layout of appearance subregions. The temporal informa-

    tion is the duration time of each subregion. That means each

    subregion crosses several frames. We use GMMs to imple-

    ment this spatialtemporal appearance model. Each Gaussian

    distribution represents a subregion. Therefore, our appearance

    model not only provides stable appearance features based on

    statistical distribution, but also provides spatial and temporal

    information of appearance. Based on our method, we obtain

    better performance compared with state-of-the-art methods.

    Figs. 46 show some snapshots of our tracking results.

    Color arrows show the occlusions of targets and association

    results.

    In the third experiment, we test the robustness of our

    approach on Terrace sequence. The objects appear hundreds

    even thousands of frames, and encounter several even dozens

    of occlusions. The appearance of object is also varying more

    often. The tracklet association will be more complicated. Since

    our approach is a monocular tracking method using videos of

    camera zero, the results cannot be compared with [42] and

    [43], which used multicamera video sequences. Our monocular

    results and snapshots are shown in Table VI and Fig. 7.

    From the results, our method can track most of the trajec-

    tories correctly using monocular camera. Some fragments and

    ID switches still exist, especially when the scene is too crowed

    up to nine persons in the scene at the same time. The human

    detector often misses detection responses, even gives false

    positive responses. Therefore, some reliable tracklets cannot

    be built successfully based on the conservative method.

    In the fourth experiment, the computational cost of our

    approach is evaluated on the four datasets. We use the Intel

    Core i7 quad-core 2.0 GHz CPU and 4 GB memory to test

    the computational cost. The result is shown in Table VII. It is

    only the computational cost of tracking, does not include the

    computational cost of human detection. The results show the

    number of total tracklets which we build, the number of final

    tracks and the average frame per second. For CAVIAR and

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    12/13

    372 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 24, NO. 3, MARCH 2014

    Fig. 7. Snapshots of EPFL terrace dataset. (a) Frame 1590. (b) Frame 1629. (c) Frame 1663. (d) Frame 1693. (e) Frame 1716. (f) Frame 1760.(g) Frame 1854.

    TABLE VII

    Computational Cost of Our Approach on Each Dataset

    Terrace dataset, the frame rate is acceptable. For TRECVID

    and ETH dataset, the frame rate is low. The main reason is

    that the frame resolution of TRECVID and ETH is four times

    that of CAVIAR and Terrace dataset. In addition, the number

    of targets is more than that of CAVIAR and Terrace dataset.

    The TRECVID and ETH datasets need to compute more data.

    VII. Conclusion

    In this paper, we propose a novel appearance representation

    method called spatialtemporal appearance model based on

    the distribution of GMM. We use this appearance model to

    represent appearance of a tracklet as a whole with dynamic

    subregions and dynamic duration time of each subregion.

    Furthermore, since the spatialtemporal appearance model is

    based on the statistical distribution of GMM, it can avoidillumination, pose variation, and image noise to obtain sta-

    ble appearance features. Finally, we associate tracklets using

    Bayesian prediction and JSD to obtain the final tracking

    results. Our approach is tested on four challenging datasets.

    The experimental results show that our approach can achieve

    good results.

    References

    [1] C. Huang, B. Wu, and R. Nevatia, Robust object tracking by hierar-chical association of detection responses, in Proc. Eur. Conf. Comput.Vision, Oct. 2008, pp. 788801.

    [2] J. Xing, H. Ai, and S. Lao, Multi-object tracking through occlusionsby local tracklets filtering and global tracklets association with detectionresponses, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern

    Recognit., Jun. 2009, pp. 12001207.[3] B. Yang, C. Huang, and R. Nevatia, Learning affinities and de-

    pendencies for multi-target tracking using a CRF model, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 12331240.

    [4] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based trackingusing the integral histogram, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., Jun. 2006, pp. 798805.

    [5] H. Wang, D. Suter, K. Schindler, and C. Shen, Adaptive object trackingbased on an effective appearance filter,IEEE Trans. Pattern Anal. Mach.

    Intell., vol. 29, no. 9, pp. 16611667, Sep. 2007.[6] D. Reid, An algorithm for tracking multiple targets, IEEE Trans.

    Autom. Comtrol, vol. 24, no. 6, pp. 843854, Dec. 1979.[7] T. E. Fortmann, Y. Bar-shalom, and M. Scheffe, Sonar tracking of

    multiple targets using joint probabilistic data association, IEEE J.Ocean. Eng., vol. 8, no. 3, pp. 173184, Jul. 1983.

    [8] M. Isard and A. Blake, Condensation-conditional density propagationfor visual tracking, Int. J. Comput. Vision, vol. 29, no. 1, pp. 528,1998.

    [9] K. Okuma, A. Taleghani, N. De. Freitas, J. Little, and D. Lowe, Aboosted particle filter: Multitarget detection and tracking, in Proc. Eur.Conf. Comput. Vision, 2004, pp. 2839.

    [10] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, Tracking in lowframe rate video: A cascade particle filter with discriminative observersof different lifespans, inProc. IEEE Comput. Soc. Conf. Comput. VisionPattern Recognit., Jun. 2007, pp. 18.

    [11] M. D. Breitenstein, F. Reichlin, B. Leibe, E. K. Meier, and L. V. Gool,Robust tracking-by-detection using a detector confidence particle filter,in Proc. IEEE Int. Conf. Comput. Vision, Sep. 2009, pp. 15151522.

    [12] M. Yang, F. Lv, W. Xu, and Y. Gong, Detection driven adaptive multi-cue integration for multiple human tracking, in Proc. IEEE Int. Conf.Comput. Vision, Sep. 2009, pp. 15541561.

    [13] C. Shan, T. Tan, and Y. Wei, Real-time hand tracking using a meanshift embedded particle filter, Pattern Recognit., vol. 40, no. 7, pp.19581970, 2007.

    [14] Y. Cai, N. De. Freitas, and J. J. Little, Robust visual tracking formultiple targets, inProc. Eur. Conf. Comput. Vision, 2006, pp. 107118.

    [15] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking,IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564577,May 2003.

    [16] Z. H. Khan, I. Y. Gu, and A. G. Backhouse, Robust visual objecttracking using multi-mode anisotropic mean shift and particle filters,

    IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 1, pp. 7487,Jan. 2011.

  • 7/27/2019 Multihuman Tracking Based on a SpatialTemporal

    13/13

    SHEN AND MIAO: MULTIHUMAN TRACKING BASED ON SPATIALTEMPORAL APPEARANCE MATCH 373

    [17] J. F. Henriques, R. Caseiro, and J. Batista, Globally optimal solutionto multi-object tracking with merged measurements, in Proc. IEEE Int.Conf. Comput. Vision, Nov. 2011, pp. 24702477.

    [18] Z. Wu, T. H. Kunz, and M. Betke, Efficient track linking methodsfor track graphs using network-flow and set-cover techniques, in Proc.

    IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2011,pp. 11851192.

    [19] W. Brendel, M. Amer, and S. Todorovic, Multiobject tracking asmaximum weight independent set, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2011, pp. 12731280.

    [20] M. Andriluka, S. Roth, and B. Schiele, People-tracking-bydetectionand people-detection-by-tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.

    [21] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, Part-basedmultiple-person tracking with partial occlusion handling, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012, pp.18151821.

    [22] H. Izadinia, I. Saleemi, W. Li, and M. Shah, (MP)2T: Multiple peoplemultiple parts tracker, in Proc. Eur. Conf. Comput. Vision, Oct. 2012,pp. 100114.

    [23] Y. Li, C. Huang, and R. Nevatia, Learning to associate: Hybrid-boostedmulti-target tracker for crowded scene, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2009, pp. 29532960.

    [24] C. Kuo, C. Huang, and R. Nevatia, Multi-target tracking by on-linelearned discriminative appearance models, in Proc. IEEE Comput. Soc.Conf. Comput. Vision Pattern Recognit., Jun. 2010, pp. 685692.

    [25] C. Kuo and R. Nevatia, How does person identity recognition helpmulti-person tracking? in Proc. IEEE Comput. Soc. Conf. Comput.

    Vision Pattern Recognit., Jun. 2011, pp. 12171224.[26] B. Yang and R. Nevatia, Multi-target tracking by online learning of

    non-linear motion patterns and robust appearance models, in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2012,pp. 19181925.

    [27] R. E. Kalman, A new approach to linear filtering and predictionproblems, Trans. ASME J. Basic Eng., vol. 82, no. 1, pp. 3545, 1960.

    [28] S. Kwak, W. Nam, B. Han, and J. H. Han, Learning occlusion withlikelihoods for visual tracking, in Proc. IEEE Int. Conf. Comput. Vision,Nov. 2011, pp. 15511558.

    [29] J. Fan, X. Shen, and Y. Wu, Scribble tracker: A matting-based approachfor robust tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 8, pp. 16331644, Aug. 2012.

    [30] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 14091422,Jul. 2012.

    [31] J. Fan, Y. Wu, and S. Dai, Discriminative spatial attention for robust

    tracking, in Proc. Eur. Conf. Comput. Vision, Sep. 2010, pp. 480493.[32] S. T. Birchfield and S. Rangarajan, Spatiograms versus histograms

    for region-based tracking, in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. Jun. 2005, pp. 11581163.

    [33] J. Lin, Divergence measures based on the shannon entropy, IEEETrans. Inf. Theory, vol. 37, no. 1, pp. 145151, Jan. 1991.

    [34] B. Yang and R. Nevatia, Online learned discriminative part-basedappearance models for multi-human tracking, in Proc. Eur. Conf.Comput. Vision, Oct. 2012, pp. 484498.

    [35] H. W. Kuhn, The hungarian method for the assignment problem,NavalRes. Logistics Quart., vol. 2, nos. 12, pp. 8397, 1955.

    [36] S. Kullback and R. Leibler, On information and sufficiency,Ann. Math.Stat., vol. 22, no. 1, pp. 7986, 1951.

    [37] A. Ulges, C. Lampert, D. Keysers, and T. Breuel, Spatiogram basedshot distances for video retrieval, in Proc. Online Text Retrieval Conf.Video Retrieval Eval., 2006, pp. 110.

    [38] Y. Shen, Z. Miao, and J. Zhang, Unsupervised online learning trajectory

    analysis based on weighted directed graph, in Proc. IEEE Int. Conf.Pattern Recognit., Nov. 2012, pp. 13061309.

    [39] National Institute of Standards and Technology. Trecvid 2008Evaluation for Surveillance Event Detection [Online]. Available:http://www.nist.gov/speech/tests/trecvid/2008/

    [40] Caviar Dataset, (2004, Jan.) EC Funded CAVIAR project/IST 200137540. [Online]. Available: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/

    [41] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, A mobile vision systemfor robust multi-person tracking, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2008, pp. 18.

    [42] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multi-camera peopletracking with a probabilistic occupancy map,IEEE Trans. Pattern Anal.

    Mach. Intell., vol. 30, no. 2, pp. 267282, Feb. 2008.[43] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, Multiple object tracking

    using k-shortest paths optimization, IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 9, pp. 18061819, Sep. 2011.

    [44] Q. Zhu, S. Avidan, M. C. Yeh, and K. T. Cheng, Fast human detectionusing a cascade of histograms of oriented gradients, in Proc. IEEEComput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2006, pp.14911498.

    [45] X. Wang, T. X. Han, and S. Yan, An HOG-LBP human detector withpartial occlusion handling, in Proc. IEEE Int. Conf. Comput. Vision,Oct. 2009, pp. 3239.

    [46] C. Huang and R. Nevatia, High performance object detection bycollaborative learning of joint ranking of granules features, in Proc.

    IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., Jun. 2010,pp. 4148.

    [47] B. Wu and R. Nevatia, Tracking of multiple, partially occluded humansbased on static body part detection, in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., Jun. 2006, pp. 951959.

    Yuan Shen received the B.E. and M.E. degrees in2008 and 2010, respectively, from Beijing JiaotongUniversity, Beijing, China, where he is currentlyworking toward the Ph.D. degree.

    His research interests include pattern recognition,machine learning, video surveillance, multiobjecttracking, and trajectory analysis.

    Zhenjiang Miao (M11) received the B.E. degreefrom Tsinghua University, Beijing, China, in 1987and the M.E. and Ph.D. degrees from NorthernJiaotong University, Beijing, in 1990 and 1994,respectively.

    From 1995 to 1998, he was a Post-Doctoral Fellowwith Ecole Nationale Superieure dElectrotechnique,dElectronique, dInformatique, dHydraulique etdes Telecommunications, Institut National Polytech-nique de Toulouse, Toulouse, France, and was aResearcher with Institute National de la Recherche

    Agronomique, Sophia Antipolis, France. From 1998 to 2004, he was withInstitute of Information Technology, National Research Council Canada,Nortel Networks, Ottawa, ON, Canada. He joined Beijing Jiaotong University,Beijing, in 2004, where he currently is a Professor. His research interestsinclude image and video processing, multimedia processing, and intelligent

    humanmachine interaction.