th_visual object tracking in challenging situations using a bayesian perspective
TRANSCRIPT
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
1/162
Universidad Politecnica de MadridEscuela Tecnica Superior
de Ingenieros de Telecomunicacion
Visual Object Tracking in Challenging
Situations using a Bayesian Perspective
Seguimiento visual de objetos en situaciones
complejas mediante un enfoque bayesiano
Ph.D. Thesis
Tesis Doctoral
Carlos Roberto del Blanco Adan
Ingeniero de Telecomunicacion
2010
http://0_frontmatter/figures/EtsiTeleco_new.eps -
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
2/162
ii
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
3/162
Departamento de Senales, Sistemas yRadiocomunicaciones
Escuela Tecnica Superiorde Ingenieros de Telecomunicacion
Visual Object Tracking in Challenging
Situations using a Bayesian Perspective
Seguimiento visual de objetos en situacionescomplejas mediante un enfoque bayesiano
Ph.D. Thesis
Tesis Doctoral
Autor:
Carlos Roberto del Blanco AdanIngeniero de Telecomunicacion
Universidad Politecnica de Madrid
Director:
Fernando Jaureguizar NunezDoctor Ingeniero de Telecomunicacion
Profesor titular del Dpto. de Senales, Sistemas y
Radiocomunicaciones
Universidad Politecnica de Madrid
2010
iii
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
4/162
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
5/162
TESIS DOCTORAL
Visual Object Tracking in Challenging Situations using a Bayesian
Perspective
Seguimiento visual de objetos en situaciones complejas mediante un
enfoque bayesiano
Autor: Carlos Roberto del Blanco Adan
Director: Fernando Jaureguizar Nunez
Tribunal nombrado por el Mgfco. y Excmo. Sr. Rector de la Universidad Politecnica
de Madrid, el da . . . . de . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . de 2010.
Presidente D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Secretario D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Realizado el acto de defensa y lectura de la Tesis el da . . . . de . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . d e 2010 en . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calificacion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EL PRESIDENTE LOS VOCALES
EL SECRETARIO
v
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
6/162
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
7/162
A Vanessa, a mis padres, a mis hermanos.
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
8/162
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
9/162
Acknowledgements
Me gustara agradecer a un gran numero de personas el que hayan compar-
tido mi andadura por el camino de la tesis. Empezare por mi mujer Vanessa
que tanto apoyo me ha dado y que en mas de una ocasion ha tenido que
aguantar la version mas irascible de mi mismo. Seguire con mis padres y
hermanos que tantas veces me han preguntado si me quedaba mucho para
terminar la tesis. Continuare con todos los miembros y visitantes del GTI,
haciendo una mencion especial a Fernando, Narciso y Luis, los cuales han
sufrido los estragos de mis artculos escritos en ese ingles tan espanol. Sin
olvidar el esfuerzo y migranas que a Fernando, mi tutor, le ha costado leer
esta tesis, gracias a la cual debe odiar al senor Bayes. Sin duda me gustara
escribir unas lneas de todos mis companeros del GTI con los que tantacafena he compartido y que han hecho tan agradable mi vida laboral. Sin
embargo, eso supondra escribir, mas que un tomo de tesis, toda la enciclo-
pedia Salvat. Por ello me veo obligado a hacer algo mas alternativo a la
par que dudosamente util: una tabla con las edades del GTI, en la que se
puede ver a todos mis companeros (o al menos esa ha sido mi intencion) a
lo largo del viaje temporal de mi tesis.
This work has been partially supported by the Ministerio de Ciencia e In-
novacion of the Spanish government by means of a Formaci on del Per-
sonal Investigador fellowship and the projects TIN2004-07860 (Medusa)
and TEC2007-67764 (SmartVision).
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
10/162
Era Periodo Especmenes Procedencia
Proterozoico - Narciso, Fernando, Luis, Espana
Francisco, Julian, Nacho
Arquezoico - Marcos N., Juan Carlos, Carlos R. Espana, EEUU
Marcos A., Usoa, Shagniq
Paleozoico
Cambrico Carlos C., Daniel A., Sharko Espana, Macedonia
Ordovcico Raul, Jon, Angel Espana
Silurico Irena, Kristina, Binu Macedonia, India
Devonico Nerea, Pieter Espana, Belgica
Carbonfero Pablo, Victor, Gian Luca Espana, Italia
Permico Hui, Xioadan, Yi, Yang China
Mesozoico
Triasico Shankar, Ravi, Gogo, Antonio India, Macedonia, Brasil
Jurasico Filippo, Maykel, Esther Italia, Cuba, Espana
Cretazico Sasho, Cesar Macedonia, Espana
Cenozoico
Paleoceno Daniel B., Claire Espana, Francia
Eoceno Lihui, Yu, Ivana China, Macedonia
Oligoceno Toni, Richard, Carlos G. Espana, Peru
Mioceno Manuel, Massimo, Jesus Espana, Italia
Plioceno Rafa, Sergio Espana
Pleistoceno Su, Wenjia, Xiang, Iviza China, Macedonia
Holoceno Samira, Abel, Pratik, Srimanta Iran, Espana, India
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
11/162
Abstract
The increasing availability of powerful computers and high quality video
cameras has allowed the proliferation of video based systems, which perform
tasks such as vehicle navigation, traffic monitoring, surveillance, etc. A
fundamental component in these systems is the visual tracking of objects
of interest, whose main goal is to estimate the object trajectories in a video
sequence. For this purpose, two different kinds of information are used:
detections obtained by the analysis of video streams and prior knowledge
about the object dynamics. However, this information is usually corrupted
by the sensor noise, the varying object appearance, illumination changes,
cluttered backgrounds, object interactions, and the camera ego-motion.
While there exist reliable algorithms for tracking a single object in con-
strained scenarios, the object tracking is still a challenge in uncontrolled
situations involving multiple interacting objects, heavily-cluttered scenar-
ios, moving cameras, and complex object dynamics. In this dissertation,
the aim has been to develop efficient tracking solutions for two complex
tracking situations. The first one consists in tracking a single object in
heavily-cluttered scenarios with a moving camera. To address this situa-
tion, an advanced Bayesian framework has been designed that jointly models
the object and camera dynamics. As a result, it can predict satisfactorily
the evolution of a tracked object in situations with high uncertainty about
the object location. In addition, the algorithm is robust to the background
clutter, avoiding tracking failures due to the presence of similar objects.
The other tracking situation focuses on the interactions of multiple objects
with a static camera. To tackle this problem, a novel Bayesian model has
been developed, which manages complex object interactions by means of
an advanced object dynamic model that is sensitive to object interactions.
This is achieved by inferring the occlusion events, which in turn trigger
different choices of object motion. The tracking algorithm can also handle
false and missing detections through a probabilistic data association stage.
Excellent results have been obtained using publicly available databases,
proving the efficiency of the developed Bayesian tracking models.
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
12/162
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
13/162
Resumen
La creciente disponibilidad de potentes ordenadores y camaras de alta cal-
idad ha permitido la proliferacion de sistemas basados en vdeo para la
navegacion de vehculos, la monitorizacion del trafico, la vdeo-vigilancia,
etc. Una parte esencial en estos sistemas es seguimiento de objetos, siendo
su principal objetivo la estimacion de las trayectorias en secuencias de vdeo.
Para tal fin, se usan dos tipos de informacion: las detecciones obtenidas del
analisis del vdeo y el conocimiento a priori de la dinamica de los objetos.
Sin embargo, esta informacion suele estar distorsionada por el ruido del sen-
sor, la variacion en la apariencia de los objetos, los cambios de iluminacion,
escenas muy estructuradas y el movimiento de la camara.
Mientras existen algoritmos fiables para el seguimiento de un unico objeto
en escenarios controlados, el seguimiento es todava un reto en situaciones
no restringidas caracterizadas por multiples objetos interactivos, escenarios
muy estructurados y camaras en movimiento. En esta tesis, el objetivo ha
sido el desarrollo de algoritmos de seguimientos eficientes para dos situa-
ciones especialmente complicadas. La primera consiste en seguir un unico
objeto en escenas muy estructuradas con una camara en movimiento. Para
tratar esta situacion, se ha disenado un sofisticado marco bayesiano que
modela conjuntamente la dinamica de la camara y el objeto. Esto per-
mite predecir satisfactoriamente la evolucion de la posicion de los objetos
en situaciones de gran incertidumbre. Ademas, el algoritmo es robusto a
fondos estructurados, evitando errores por la presencia de objetos similares.
La otra situacion considerada se ha centrado en las interacciones de objetos
con una camara estatica. Para tal fin, se ha desarrollado un novedoso mod-
elo bayesiano que gestiona las interacciones mediante un avanzado modelo
dinamico. Este se basa en la inferencia de oclusiones entre objetos, las cuales
a su vez dan lugar a diferentes tipos de movimiento de objeto. El algoritmo
es tambien capaz de gestionar detecciones perdidas y falsas detecciones a
traves de una etapa de asociacion de datos probabilstica.
Se han obtenido excelentes resultados en diversas bases de datos, lo que
prueba la eficiencia de los modelos bayesianos de seguimiento desarrollados.
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
14/162
xiv
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
15/162
Contents
List of Figures xvii
List of Tables xix
1 Introduction 1
2 Bayesian models for object tracking 5
2.1 Tracking with moving cameras . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Tracking of multiple interacting objects . . . . . . . . . . . . . . . . . . 12
3 Bayesian Tracking with Moving Cameras 17
3.1 Optimal Bayesian estimation for object tracking . . . . . . . . . . . . . 17
3.1.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 20
3.2 Bayesian tracking framework for moving cameras . . . . . . . . . . . . . 21
3.3 Object tracking in aerial infrared imagery . . . . . . . . . . . . . . . . . 24
3.3.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2.1 Strong ego-motion situation . . . . . . . . . . . . . . . . 36
3.3.2.2 High uncertainty ego-motion situation . . . . . . . . . . 39
3.3.2.3 Global tracking results . . . . . . . . . . . . . . . . . . 43
3.4 Object tracking in aerial and terrestrial visible imagery . . . . . . . . . 46
3.4.1 Particle Filter approximation . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xv
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
16/162
CONTENTS
4 Bayesian tracking of multiple interacting objects 65
4.1 Description of the multiple object tracking problem . . . . . . . . . . . . 65
4.2 Bayesian tracking model for multiple interacting objects . . . . . . . . . 69
4.2.1 Transition pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Approximate inference based on Rao-Blackwellized particle filtering . . 85
4.3.1 Kalman filtering of the object state . . . . . . . . . . . . . . . . . 91
4.3.2 Particle filtering of the data association and object occlusion . . 94
4.4 Ob ject detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5 Conclusions and future work 125
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6 Appendix 131
6.1 Conditional independence and d-separation . . . . . . . . . . . . . . . . 131
References 133
xvi
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
17/162
List of Figures
3.1 Graphical model for the Bayesian object tracking. . . . . . . . . . . . . 18
3.2 Consecutive frames of an aerial infrared sequence . . . . . . . . . . . . . 25
3.3 Multimodal LoG filter response . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Likelihood distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Initial translational transformations . . . . . . . . . . . . . . . . . . . . 30
3.6 Probability values for the ego-motion hypothesis . . . . . . . . . . . . . 30
3.7 Metropolis-Hastings sampling of the likelihood distribution . . . . . . . 32
3.8 Particle approximation of the posterior pdf . . . . . . . . . . . . . . . . 33
3.9 SIR resampling of the posterior pdf . . . . . . . . . . . . . . . . . . . . . 33
3.10 Kernel density estimation and state estimation . . . . . . . . . . . . . . 35
3.11 Object tracking result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12 Intermediate results for a situation of strong ego-motion . . . . . . . . . 37
3.13 Tracking results for the BEH algorithm under strong ego-motion . . . . 38
3.14 Tracking results for the DEH algorithm under strong ego-motion . . . . 39
3.15 Tracking results for the NEH algorithm under strong ego-motion . . . . 40
3.16 Intermediate results for a situation greatly affected by the aperture problem 41
3.17 Tracking results for the BEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.18 Tracking results for the DEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.19 Tracking results for the NEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.20 Example of the similarity measurement between image regions . . . . . 50
3.21 Example of feature correspondence . . . . . . . . . . . . . . . . . . . . . 53
xvii
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
18/162
LIST OF FIGURES
3.22 Representation of the affine transformation hypothesis . . . . . . . . . . 55
3.23 Samples of the object position . . . . . . . . . . . . . . . . . . . . . . . . 56
3.24 Samples of ellipsis enclosing the object . . . . . . . . . . . . . . . . . . . 56
3.25 Weighted sampled representation of the posterior pdf . . . . . . . . . . . 57
3.26 Tracking results with a camera mounted on a car . . . . . . . . . . . . . 59
3.27 Tracking results with a camera mounted on a helicopter . . . . . . . . . 61
4.1 Set of detections yielded by multiple detectors . . . . . . . . . . . . . . . 67
4.2 Data association between detections and objects . . . . . . . . . . . . . 68
4.3 Object dynamic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Graphical model for multiple object tracking . . . . . . . . . . . . . . . 71
4.5 Graphical model for the initial time step . . . . . . . . . . . . . . . . . . 74
4.6 Restrictions imposed to the associations between detections and objects. 79
4.7 Restrictions imposed to the occlusions among objects. . . . . . . . . . . 80
4.8 Color histograms of two object categories . . . . . . . . . . . . . . . . . 100
4.9 Similarity maps of the color histograms . . . . . . . . . . . . . . . . . . 102
4.10 Computed detections from the red dressed team. . . . . . . . . . . . . . 104
4.11 Computed detections from the black and white dressed team. . . . . . . 105
4.12 Tracking results for a simple object cross . . . . . . . . . . . . . . . . . . 110
4.13 Marginalization of the posterior pdf over one specific object . . . . . . . 111
4.14 Marginalization of the posterior pdf over one specific object . . . . . . . 112
4.15 Tracking results for a complex object cross . . . . . . . . . . . . . . . . 113
4.16 Marginalization of the posterior pdf over one specific object . . . . . . . 114
4.17 Marginalization of the posterior pdf over one specific object . . . . . . . 115
4.18 Marginalization of the posterior pdf over one specific object . . . . . . . 116
4.19 Tracking results for an overtaking action . . . . . . . . . . . . . . . . . . 1174.20 Marginalization of the posterior pdf over one specific object . . . . . . . 118
4.21 Marginalization of the posterior pdf over one specific object . . . . . . . 119
4.22 Marginalization of the posterior pdf over one specific object . . . . . . . 120
6.1 Concepts of d-separation and descendants . . . . . . . . . . . . . . . . . 132
xviii
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
19/162
List of Tables
2.1 Tracking problems related to the data association . . . . . . . . . . . . . 13
3.1 Quantitative results for object tracking with a moving camera in infrared
imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Quantitative results for object tracking with a moving camera in visible
imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Quantitative results for interacting objects 1/2 . . . . . . . . . . . . . . 122
4.2 Quantitative results for interacting objects 2/2 . . . . . . . . . . . . . . 123
xix
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
20/162
LIST OF TABLES
xx
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
21/162
Chapter 1
Introduction
The evolution and spreading of the technology nowadays have allowed the prolifera-
tion of video based systems, which make use of powerful computers and high quality
video cameras to automatically perform increasingly demanding tasks such as vehicle
navigation, traffic monitoring, human-computer interaction, motion-based recognition,
security and surveillance, etc. Visual object tracking is a fundamental part in all of
the previous tasks, and also in the field of computer vision in general. This fact has
motivated a great deal of interest in object tracking algorithms. The ultimate goal of
tracking algorithms is to estimate the object trajectories in a video sequence. For this
purpose, two different kinds of information are used: the video streams acquired by
the camera sensor and the prior knowledge about the tracked objects and the envi-
ronment. The video-stream based information is used to compute object detections in
each frame, also known as observations or measurements. The detection process uses
the most distinctive appearance features, such as color, gradient, texture, and shape,
to minimize the probability of false detections and at the same time to maximize the
detection probability. However, the object appearance can undergo significant varia-
tions that cause noisy detections and even missing detections, i.e. tracked objects that
have not been detected. The appearance variations can be produced by articulated or
deformable objects, illumination changes due to weather conditions (typical in outdoor
applications), and variations in the camera point of view. Object interactions, such as
partial and total occlusions, are another source of noisy and missing detections. On the
other hand, scene structures similar to the objects of interest can cause false detections,
and thus obfuscating the tracking process. To alleviate these detection shortcomings,
1
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
22/162
1. INTRODUCTION
the tracking also relies on the available prior information in order to constrain the tra-
jectory estimation problem. This kind of information is mainly the object dynamics,
which is used to predict the evolution of the object trajectories. The modeling of the
object dynamics can be a very difficult task, especially in situations in which objects
undergo complex interactions. On the other hand, object dynamic information is only
meaningful for static or quasi-static cameras, since, in the case of moving cameras, a
global motion is induced in the image, called ego-motion, that corrupts the trajectory
predictions. As a result, the camera dynamics must be also modeled, which makes
more complex the tracking and increases the uncertainty in the trajectory estimation.While there exist reliable algorithms for the tracking of a single object in constrained
scenarios, the object tracking is still a challenge in uncontrolled situations involving
multiple interacting objects, heavily-cluttered scenarios, moving cameras, objects with
varying appearance, and complex object dynamics. In this dissertation, the main aim
has been the development of efficient tracking solutions for two of these complex track-
ing situations. The first one consists in tracking a single object in heavily-cluttered
scenarios with a moving camera. For this purpose, an advanced Bayesian framework
has been designed that jointly models the object and camera dynamics. This allowsto predict satisfactorily the evolution of the tracked object in situations of high uncer-
tainty, in which several object locations are possible because of the combined dynamics
of the object and the camera. In addition, the algorithm is robust to the background
clutter, avoiding tracking failures due to the presence of similar objects to the tracked
one in the background. The inference of the tracking information in the proposed
Bayesian model cannot be performed analytically, i.e. there is not a closed form ex-
pression to directly compute the required tracking information. This situation arises
from the fact that the dynamic and observation processes involved in the Bayesian
tracking framework are non-linear and non-Gaussian. In order to deal with this prob-
lem, a suboptimal inference method has been derived that makes use of the particle
filtering technique to compute an accurate approximation of the object trajectory.
The other unrestricted tracking situation focuses on the interactions of multiple
objects with a static camera. To successfully tackle this problem, a novel recursive
Bayesian model has been developed to explicitly manage complex object interactions.
This is accomplished by an advanced object dynamic model that is sensitive to the
object interactions involving long-term occlusions of two or more objects. For this
2
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
23/162
purpose, the proposed Bayesian tracking model uses a random variable to predict the
occlusion events, which in turn triggers different choices of object motion. The track-
ing algorithm is also able to handle false and missing detections through a probabilistic
data association stage, which efficiently computes the correspondence between the un-
labeled detections and the tracked objects. Regarding the inference of the tracking
information in the proposed Bayesian model for interacting objects, two major issues
have been carefully addressed. The first one is the mathematical derivation of the
posterior distribution of the object tracking information, which has been a challenging
task due to the complexity of the tracking model. The second issue, closely relatedto the first one, arises from the fact that the derived mathematical expression for the
posterior distribution has not an analytical form due to the involved complex integrals.
This situation is caused by the non-linear and non-Gaussian character of the stochastic
processes involved in the Bayesian tracking model, i.e. the dynamic, observation and
occlusion processes. Subsequently, the inference has to be accomplished by means of
suboptimal methods, such as the particle filtering. However, the high dimensionality of
the tracking problem, proportional to the number of tracked objects and object detec-
tions, causes that the accuracy of the approximate posterior distribution be very poor.
To overcome this drawback, a novel suboptimal inference method has been developed
which combines the particle filtering technique with a variance reduction technique,
called Rao-Blackwellization. This allows to obtain an accurate approximation of the
object trajectories in high dimensional state spaces, involving multiples tracked objects.
The organization of the dissertation is as follows. In Chap. 2, a state of the art
of the most remarkable object tracking techniques for multiple objects is presented,
placing special emphasis in Bayesian models, strategies for handling moving cameras,
and the management of multiple objects. The developed recursive Bayesian model
for tracking a single object in heavily-cluttered scenarios with a moving camera is
described in Chap. 3. At the end of the chapter, tracking results of the proposed
Bayesian framework is presented for two kinds of applications, one involving aerial
infrared imagery, and another dealing with both terrestrial and aerial visible imagery.
In Chap. 4, the developed Bayesian tracking solution for multiple interacting objects is
presented, along with a test bench to evaluate the efficiency of the tracking in object
interactions. Lastly, conclusions and future lines of research are set out in Chap. 5.
3
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
24/162
1. INTRODUCTION
4
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
25/162
Chapter 2
Bayesian models for object
tracking
Visual object tracking is a fundamental task in a wide range of military and civilian
applications, such as surveillance, security and defense, autonomous vehicle navigation,
robotics, behavior analysis, traffic monitoring and management, human-computer in-
terface, video retrieval, and many more. The visual tracking can be defined as the
problem of estimating the trajectories of a set of objects of interest in a video sequence
as they move around the scene. In a typical tracking application there are one or more
object detectors that generate a set of noisy measurements or detections in discrete
time instants. The uncertainty of the detection process arises from the noise of camera
sensor, changes in the scene illumination, variations in the appearance of the objects,
non-rigid and/or articulated objects, and loss of information caused by the projection
of the 3D world onto the 2D image plane. The tracking algorithm must be able to han-
dle the uncertainty in the detection process to assign consistent labels to the tracked
objects in each frame of a video sequence. This process can be simplified by imposing
certain constrains on the motion of the objects. For this purpose, a dynamic model can
be used to predict the motion of the objects, restricting in this way the spatio-temporal
evolution of the trajectories. Nonetheless, the dynamic model is only an approximation
of the underlying object dynamics, which indeed can be very complex. As a result, the
tracking algorithm have to manage different sources of information (detections and ob-
ject dynamics), taking into account their own uncertainties, to efficiently estimate the
object trajectories.
5
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
26/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
Bayesian estimation is the most commonly used framework in visual tracking and
also in other contexts such as radar and sonar. This framework models in a probabilistic
way the tracking problem, and all its sources of uncertainty: sensor noise, inaccurate
dynamic models, environmental clutter, etc. From a Bayesian perspective, the aim is to
compute the posterior distribution over the object state, which is a vector containing
all the desired tracking information such as, position, velocity, etc. This posterior
distribution encodes all the necessary information to efficiently compute an estimation
of the object state. The computation of the posterior distribution is usually performed
in a recursive way via two cyclic stages: prediction and update. Thus, the computationis efficiently performed since it is only required the previous estimation of the posterior
distribution, and the set of detections at the current time step. The prediction stage
evolves the posterior distribution at the previous time step according to the object
dynamics, obtaining as a result the predicted posterior distribution at the current time
step. The update stage makes use of the available detections at the current time step
to correct the predicted posterior distribution by means of the likelihood model of the
object detector.
In single object tracking with static cameras, the main difficulty arises from the factthat realistic models for the object dynamics and detection processes are often non-
linear and non-Gaussian, which leads to a posterior distribution without a closed-form
analytic expression. In fact, only in a limited number of cases there exist close-form ex-
pressions. The most well-known closed-form expression is the Kalman filter (1), which
is obtained when both the dynamic and likelihood models are linear and Gaussian.
Grid based approaches (2) overcome the limitations imposed on Kalman filter by re-
stricting the state space to be discrete and finite. If any of the previous assumptions
does not hold, the exact computation of posterior distribution is not possible, and it
becomes necessary to resort to approximate inference methods that computes an ap-
proximation of the posterior distribution. The extended Kalman filter (1) linearizes
models with weak non-linearities using the first term in a Taylor expansion, so that
the Kalman filter expression can be still applied. Nonetheless, the performance of the
extended Kalman filter rapidly decreases as the non-linearities becomes more severe.
The unscented Kalman filter (3; 4) has proved to be more efficient in models that
are moderately non-linear. It recursively propagates a set of selected sigma points to
maintain the second order statistics of the posterior distribution. Both approximate
6
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
27/162
solutions, extended and unscented Kalman filters, assume that the underlying posterior
distribution is Gaussian. But if this assumption does not hold (e.g. the distribution is
heavily skewed or multimodal), the accuracy of the estimation can be randomly poor.
The Gaussian sum filter (5) was one of the first attempts to deal with non-Gaussian
models, approximating the posterior distribution by a mixture of Gaussians. The main
limitation of the Gaussian sum filter is that linear approximations are required, as in
the extended Kalman filter. Another limitation is the combinatorial growth of the
number of Gaussian components in the mixture over time. An alternative solution
for non-linear non-Gaussian models that does not need linearization is obtained by
approximate-Grid based methods (2; 6). These methods approximate the continuous
state space by a finite and fixed grid, and then they apply numerical integration for
computing the posterior distribution. The grid must be sufficiently dense to compute
an accurate approximation of the posterior distribution. However, the computational
cost increases dramatically with the dimensionality of the state space and becomes
impractical for dimensions larger than four. An additional disadvantage of grid-based
methods is that the state space cannot be partitioned unevenly in order to improve
the resolution in regions of high density probability. All these shortcomings are over-
come by the particle filtering technique (2; 7; 8), also known as Sequential Monte Carlo
method (9; 10; 11), condensation algorithm (12; 13), or bootstrap filtering (14). It is
a numerical integration technique that simulates the posterior distribution by a set of
weighted samples, known as particles, that are propagating recursively along the time.
The samples are drawn from a proposal distribution, that is the key component of the
algorithm, and evaluated by means of the dynamic and likelihood models. The particle
filter has become very successful in a wide range of tracking applications due to its
efficiency, flexibility, and easy of implementation. Moreover, its computational cost is
theoretically independent of the dimension of the state space.
To sum up, the previous tracking approaches have proved to be efficient and reliable
solutions for single object tracking provided that:
they fulfill the assumptions of linearity/non-linearity Gaussianity/non-Gaussinity
for which they were conceived,
the cameras are static, i.e. with no motion, and
7
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
28/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
there is always a unique detection for the tracked object, which only occurs in
constrained scenarios where there is total control about the number and types of
objects that compose the scene.
In the rest of situations, the tracking task is still a challenge, which is receiving a great
deal of research attention because of the wide range of potential applications that can
be developed. The main contribution of this dissertation is the development of efficient
and reliable algorithms for the tracking of objects in challenging situations. Specifically,
the research has been focused on two situations: the single object tracking with moving
cameras, and the multiple interacting object tracking with static cameras.
In the first situation, the moving camera induces a global motion in the scene,
called ego-motion, that corrupts the spatio-tempotal continuity of the video sequence.
As a consequence, the object dynamic information is not useful anymore, since the
camera motion is not considered, and the tracking performance is seriously reduced. In
Sec. 2.1, a thorough review of the main techniques that address the ego-motion problem
for single object tracking with moving cameras is presented.
In the other considered situation, the tracking algorithm has to manage several
interacting objects in an environment with static cameras. The difficulty arises from
the fact that in each time step there is a set of unlabeled detections generated from
the detectors. This means that the correspondence between objects and detections
is not known, and therefore a data association stage is required. This fact violates
the assumption that there is always a unique detection per object, since potentially
whatever detection can be associated with an object. In fact, the data association
can be very complex since the number of possible associations is combinatorial with
the number of objects and detections. Furthermore, there can be false detections and
missing detections that increase even more the complexity of the data association.
The false detections arise from the noise of the camera sensor and the scene clutter
(similar structures to the tracked object in the background). On the other hand, object
occlusions and strong variations in the object appearance can cause that one or more
of the objects are not detected, the so-called missing detections. These phenomena
can also occur for single object tracking in unconstrained scenarios, in which there is
a unique object, but there can be none, one or multiple detections. For example, the
nearest neighbor Kalman filter (15) handles the data association problem by selecting
8
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
29/162
2.1 Tracking with moving cameras
the closest detection to the predicted object trajectory, which is used to update the
posterior distribution of the object state. Unlike the previous method that only uses a
unique detection, the probabilistic data association filter (16; 17) updates the posterior
distribution utilizing all the detections that are close to the predicted object trajectory.
This is accomplished by averaging the innovation terms of the Kalman filter resulting
from the set of detections. This approach maintains the Gaussian character of the
posterior distribution. On the other hand, the data association in single object tracking
can be considered as a specific case of the data association of multiple object tracking,
where the number of tracked objects in the scene is just one. There exist a lot ofscientific literature in the field of multiple object tracking, and recently there has been
a revival of interest due to the recent developments in particle filtering and recursive
Bayesian models in general. In Sec. 2.2, the main multi-object tracking techniques are
presented, focusing on the problem of the data association.
2.1 Tracking with moving cameras
In video based applications in which the video acquisition system is mounted on a
moving aerial platform (such as a plane, a helicopter, or an Unmanned Aerial Vehicle),
a mobile robot, a vehicle, etc., the acquired video sequences undergo a random global
motion, called ego-motion, that prevents the use of the object dynamic information to
restrict the object position in the scene. As a consequence, the tracking performance
can be dramatically reduced. The ego-motion problem has been addressed in different
manners in the scientific literature. They can be split into two categories: approaches
based on the assumption of low ego-motion, and those based on the ego-motion esti-
mation.
Approaches assuming low ego-motion consider that the motion component due to
the camera is not very significant in comparison with the object motion. In this context,
some works assume that the spatio-temporal connectivity of the object is preserved
along the sequence (18; 19; 20), i.e. the image regions associated with the tracked object
are spatially overlapped in consecutive frames. Then, the tracking is performed using
morphological connected operators. In cases where the previous assumption does not
hold, the most common approach is to search for the object in a bounded area centered
in the location where it is expected to find the object, according to its dynamics.
9
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
30/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
In (21; 22), an exhaustive search is performed in a fixed-size image region centered in
the previous object location. In (23), the initial search location is estimated using a
Kalman filter, and then the search is performed deterministically using the Mean Shift
algorithm (24). Other authors (25; 26) propose a stochastic search based on particle
filtering, which is able to manage multiple initial locations for the search. However,
all these methods lose effectiveness as the displacement induced by the ego-motion
increases. The reason is the size of the search area must be enlarged to accommodate
the expected camera ego-motion, which produces that the probability that the tracking
can be distracted by false candidates increases dramatically.The other category of approaches based on the ego-motion estimation are able
to deal with strong ego-motion situations, in which the camera motion is at least as
significant as the object motion, and even more. They aim to compute the camera ego-
motion between consecutive frames in order to compensate it, and thus recovering the
spatio-temporal correlation of the video sequence. The camera ego-motion is modeled
by a geometric transformation, typically an affine or projective one, whose parameters
are estimated by means of an image registration technique. The existing works differ
in the specific image registration technique used to compute the parameters of thegeometric transformation. Extensive reviews of image registration techniques can be
found in (27; 28), where the first one tackles all kind of vision based applications, while
the second one is focused on aerial imagery. According to them, a possible classification
of the image registration techniques is: those based on features and those based on
area (i.e. image regions). Feature based image registration techniques detect and
match distinctive image features between consecutive frames to estimate a geometric
transformation, which represents the camera ego-motion model. In (29), an object
detection and tracking system with a moving airborne platform is described, which
uses a feature based approach to estimate an affine camera model. In (30), the KLT
method (31) is used to infer a bilinear camera model in an application that detects
moving objects from a mobile robot. In the field of FLIR (Forward Looking InfraRed)
imagery, the works (32; 33; 34) describe a detection and tracking system of aerial
targets mounted on an airborne platform that uses a robust statistic framework to
match edge features in order to estimate an affine camera model. This system is able
to successfully handle situations in which the camera motion estimation is disturbed by
the presence of independent moving objects, provided that there is a minimum number
10
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
31/162
2.1 Tracking with moving cameras
of detected features belonging to the background. In situations in which the detection
of distinctive features is particularly complicated, because the acquired images are low
textured and structured, an area-based image registration technique is used to estimate
the parameters of the camera model. In (35), a perspective camera model is computed
by means of an optical flow algorithm to detect moving objects in an application of
aerial visual surveillance. An optical flow algorithm is also used in (36) to estimate the
parameters of a pseudo perspective camera model, which is utilized to create panoramic
image mosaics. The same approach is followed in (37; 38) for a tracking application ofterrestrial targets in airborne FLIR imagery. In (39; 40), a target detection framework is
presented for FLIR imagery that minimizes a SSD (Sum of Squares Differences) based
error function to estimate an affine camera model. A similar framework of camera
motion compensation is used in (41) for tracking vehicles in aerial infrared imagery,
but utilizing a different minimization algorithm. In (42), the Inverse Compositional
Algorithm is used to obtain the parameters of an affine camera model for a tracking
application of vehicles in aerial imagery. Unlike the feature based image registration
techniques, the area based techniques are not robust to the presence of independent
moving objects, which can drift the ego-motion estimation. In addition, they require
that the involved images are closely aligned to achieve satisfactory results.
All the previous approaches, independently of the used camera ego-motion compen-
sation technique, have in common that they compute at most one parametric model to
represent the ego-motion between consecutive frames. However, in real applications,
the ego-motion computation can be quite challenging, because there can be several
feasible solutions, i.e. several camera geometric transformations, and not necessarily
the solution with less error is the correct one. This situation arises as a consequence
of several phenomena, such us the aperture problem (43) (related to low structured
or textured scenes), the presence of independent moving objects, changes in the scene,
and limitations of the own camera ego-motion technique. In Chap. 3, an efficient and
reliable Bayesian framework is proposed to deal with the uncertainty in the estimation
of the camera ego-motion for tracking applications.
11
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
32/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
2.2 Tracking of multiple interacting objects
Multiple object tracking can be sought as the generalization of single object tracking,
in the sense that the main goal is to recover the trajectories of multiple objects from
a video sequence, rather than only one trajectory from an unique object. However,
techniques of multiple object tracking are fundamentally different from those of sin-
gle object tracking, due to the particular problems that arise in the presence of two
or more objects. In multiple object tracking, the object detections are unlabeled and
unordered, i.e. the true correspondence between objects and detections is unknown.
The estimation of the true correspondence, called data association, suffers from the
combinatorial explosion of the possible associations, in which the computational cost
inevitably grows exponentially with the number of objects. On the other hand, data
association is a stochastic process in which the estimation of the true detection as-
sociation can be extremely difficult due to the involved uncertainty. Furthermore, in
real situations there can be none, one, or several detections per object. As a result,
there can be false detections and missing detections, in spite of the fact that the goal
of the detector is both to minimize the probability of false alarms and to maximize the
detection probability. This fact increases the complexity of the data association prob-
lem. The false detections arise from scene structures similar to the objects of interest,
which can obfuscate the tracking process. The missing detections can be originated
from changes in the object appearance, which in turn are caused by articulated or de-
formable objects, illumination changes due to weather conditions (typical in outdoor
applications), and variations in the camera point of view. Another source of missing
detections are the partial and total occlusions involved in the object interactions. All
of these phenomena are also responsible of the noisy character of the detection process.
Tab. 2.1 summarizes the mentioned sources of disturbances along with their effects,
and the derived data association problems.
A great deal of strategies have been proposed in the scientific literature to solve
the data association problem. These can be divided into single-scan and multiple-scan
approaches. Single-scan approaches perform the data association considering only the
set of available detections in a specific time step, while the multiple-scan approaches
make use of the detections acquired in a temporal interval, comprising several time
steps. Multiple-scan approaches consider that tracks are basically a sequence of noisy
12
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
33/162
2.2 Tracking of multiple interacting objects
Disturbance Effect Data association Problem
Changes in the camera Variations in the Missing detections,
point of view object appearance noisy detections
Articulated or Variations in the Missing detections,
deformable objects object appearance noisy detections
Illumination changes Variations in the Missing detections,
object appearance noisy detections
Ob ject interactions Partial or Missing detections,
total occlusions noisy detections
Scene structures similar Presence of clutter False detectionsto the objects of interest
Table 2.1: Disturbances in the detection process, their effects, and the resulting problems
in data association.
detections. Thus, the multiple object tracking consists in seeking the optimal paths
in a trellis formed by the temporal sequence of detections. In this way, the data as-
sociation problem is cast to one of association of sequence of detections. Techniques
that accomplish this task are the Viterbi algorithm (44; 45; 46), multiple scan assign-
ment (47; 48), network theoretic algorithms (49), and the expectation-maximization
algorithm (EM) (50). The precedent approaches compute a single solution that is
considered the best one, discarding a lot of feasible hypotheses that could be the true
solution. To alleviate this situation, some approaches (51; 52) compute the best N solu-
tions in order to minimize the risk of an incorrect trajectory estimation. An additional
problem is the computational cost. It is known that the multiple-scan approaches are
NP-hard problems in combinatorial optimization, i.e. their complexity is exponential
with the number of objects and detections. The most popular solution to tackle this
problem is the Lagrangian relaxation (53; 54), wherein the N dimensional assignment
problem is divided into a set of assignment problems of lower dimensionality. Another
approach (55) transforms the integer programming problem, posed by the multiple-
scan assignment, into a linear programming problem by relaxing the constraints for
an integer solution. This allows to efficiently solve the problem in polynomial time
through well-known algorithms, such as the interior point method (56).
Inside the group of single-scan approaches, the simplest one is the global nearest
13
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
34/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
neighbor algorithm (57), also known as the 2D assignment algorithm, which computes
a single association between detections and objects by minimizing a distance based
cost function. The main problem of this approach is that many feasible associations
are discarded. On the other hand, the multiple hypotheses tracker (MHT) (58; 59)
attempts to keep track of all the possible associations along the time. As it occurs
with the multiple-scan approaches, the complexity of the problem is NP-hard because
the number of association grows exponentially over time, and also with the number
of objects and detections. Therefore, additional methods are required to establish a
trade-off between the computational complexity and the handling of multiple associa-tion hypotheses. In this respect, one of the most popular methods is the joint prob-
abilistic data association filter (JPDAF) (60; 61), which performs a soft association
between detections and objects. This is carried out by combining all the detections
with all the objects, in such a way that the contribution of each detection to each
object depends on the statistical distance between them. This method prunes away
many unfeasible hypotheses, but also restricts the data association distribution to be
Gaussian, which limits the applicability of the technique. Subsequent works (62; 63)
try to overcome this limitation by modeling the data association distribution by a mix-ture of Gaussians. However, heuristics techniques are necessary to reduce the number
of components to make the algorithm computationally manageable. The probabilistic
multiple hypotheses tracker (PMHT) (64; 65) is another alternative to estimate the best
data associations hypotheses at a moderate computational cost. It assumes that the
data association is an independent process to work around the problems with pruning.
Nevertheless, the performance is similar to that of the JPDAF, although the compu-
tational cost is higher. The data association problem has been also addressed with
particle filtering, which allows to deal with arbitrary data association distributions in a
natural way. Theoretically, the algorithms based in particle filtering have the ability to
manage the best data association hypotheses with a computational cost independently
of the number of objects and detections. The computed association hypotheses consti-
tute an approximation of the true data association distribution, and the approximation
is more accurate as the number of hypotheses increases. In practice, the performance
of the particle filtering techniques depends on the ability to correctly sample associa-
tion hypotheses from a proposal distribution called importance density. In (66; 67), a
Gibbs sampler is used to sample the data association hypotheses. In a similar way, a
14
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
35/162
2.2 Tracking of multiple interacting objects
Markov Chain Monte Carlo (MCMC) (68; 69; 70) scheme has been used for drawing
samples that simulate the underlying data association distribution. The main problem
with these samplers is that they are iterative methods that need an unknown number
of iterations to converge. This fact makes them inappropriate for online applications.
Some works (71; 72) overcome this limitation by means of the design of an efficient
and non-iterative proposal distribution that depends on the specific characteristic of
the underlying dynamic and likelihood processes of the tracking system. The accuracy
of the estimation achieved by techniques based on particle filtering depends on the size
of the dimension of the state space. For high dimensional spaces, the accuracy canbe quite low. In order to deal with this drawback, a technique of variance reduction,
called Rao-Blackwellization, has been used in (73), which improves the accuracy of
the estimated object trajectories for a given number of samples or hypotheses. An
alternative to the particle filtering is the probability hypothesis density (PHD) filter
that can also address missing and false detections like the particle filtering. However,
the computational cost is exponential with the number of objects. In order to reduce
the complexity from exponential to linear, the full posterior distribution is simplified
by its first-order moment in (74). Nonetheless, this approach is only satisfactory formultivariate distributions that can be reasonable approximated by its first moment,
which can be an excessive limitation for some tracking applications.
The previous works have been designed to track multiple objects with restricted
kinds of interactions among them. For instance, these works are able to handle object
interactions involving trajectory changes but without occlusions, such as a situation
with two people who stop one in front the other. In this case the object detections are
used to efficiently correct the object trajectories. Another kind of interaction that is
successfully addressed involves object occlusions but without trajectory changes, such
as a situation with two people who cross each other maintaining their paths. In this
case, the data association stage can manage the missing detections during the occlusion,
relying on their trajectories are unchanged in order to predict their tracks. However, in
complex object interactions involving trajectory changes and occlusions, the previous
approaches are prone to fail because the occluded objects have not available detections
to correct their trajectories. This limitation arises from the fact the main tracking
techniques for multiple objects have been developed for radar and sonar applications,
in which the dynamics of the tracked objects have physical restrictions that make
15
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
36/162
2. BAYESIAN MODELS FOR OBJECT TRACKING
impossible the complex interactions that arise in visual tracking. Moreover, in the field
of radar and sonar, the objects are handled as point targets that cannot be occluded.
Some works have proposed strategies to deal with the specific problems that arise in
the field of visual tracking. In (75; 76), the data association hypotheses are drawn
using a sampling technique that is able to handle split object detections, i.e. group
of detections that have been generated from the same object. The split detections
are typical from background subtraction techniques (77), which are used to detect
moving objects in video sequences. In (78), a specific approach for handling object
interactions that involve occlusions and changes in trajectories is presented. It createsvirtual detections of possible occluded objects to cope with the changes in trajectories
during the occlusions. However, since the occlusion events are not explicitly modeled,
tracking errors can appear when a virtual detection is associated to an object that is
actually not occluded. In order to improve the performance of the tracking of multiple
objects in the field of computer vision, a novel Bayesian approach that explicitly models
the occlusion phenomenon has been developed. This approach is able to track complex
interacting objects whose trajectories change during the occlusions. Chap. 4 describes
in detail the proposed visual tracking for multiple interacting models.
16
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
37/162
Chapter 3
Bayesian Tracking with Moving
Cameras
This chapter starts with a brief overview of the optimal Bayesian framework for gen-
eral object tracking (Sec. 3.1), explaining also the basics of the particle filtering, an
approximate inference technique. Next, the developed Bayesian tracking framework for
moving cameras is presented in Sec. 3.2, which models the camera motion in a prob-
abilistic way. Lastly, Secs. 3.3 and. 3.4 show respectively how to apply the proposed
Bayesian model to two visual tracking applications for moving cameras: the first one
focused on aerial infrared imagery, and the second one for aerial and terrestrial visible
imagery.
3.1 Optimal Bayesian estimation for object tracking
The Bayesian approach for object tracking aims to estimate a state vector xt that
evolves over time using a sequence of noisy observations z1:t = {zi|i = 1,...,t} up
to time t. The state vector contains all the relevant information for the tracking at
time step k, such as the object position, velocity, size, appearance, etc. The noisy
observations z1:t (also called measurements or detections) are obtained by one or more
detectors, which analyze the video sequence information acquired by the camera to
either directly compute the object position, or indirectly obtain relevant features that
can related to the object position, such as motion, color, texture, edges, corners, etc.
From a Bayesian perspective, some degree of belief in the state xt at time t is
17
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
38/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
calculated, using the available prior information (about the object, the camera and
the scene), and the set of observations z1:t. Therefore, the tracking problem can be
formulated as the estimation of the posterior probability density function (pdf) of the
state of the object, p(xt|z1:t), conditioned to the set of observations, where the initial
pdf p(x0|z0) p(x0) is assumed to be known. This probabilistic model for the object
tracking can be graphically represented by a graph (see Fig. 3.1), called graphical
model, in which the random variables are represented by nodes, and the probabilistic
relationships among the variables by arrows.
Figure 3.1: Graphical model for the Bayesian object tracking.
For efficiency purposes, the estimation of the posterior pdf p(xt|z1:t) is recursively
performed through two stages: the prediction of the most probable state vectors us-
ing the prior information, and the update (or correction) of the prediction based on
the observations. The prediction stage involves computing the prior pdf of the state,
p(xt|z1:t1), at time t via the Chapman-Kolmogorov equation
p(xt|z1:t1) = p(xt, xt1|z1:t1)dxt1 = p(xt|xt1)p(xt1|z1:t1)dxt1, (3.1)where p(xt1|z1:t1) is the posterior pdf at the previous time step, and p(xt|xt1) is
the state transition probability, that encodes the prior information, for example the
object dynamics along with its uncertainty. The state transition probability is defined
by a possibly non-linear function of the state xt1, and an independent identically
distributed noise process vt1
xt = ft(xt1, vt1). (3.2)
18
http://./3/figures/eps/chap_3_sec_1_sub_1_fig_1.eps -
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
39/162
3.1 Optimal Bayesian estimation for object tracking
The update stage aims to reduce the uncertainty of the prediction, p(xt|z1:t1),
using the new available observation zt (observations are available at discrete times)
through the Bayes rule
p(xt|z1:t) =p(zt|xt)p(xt|z1:t1)
p(zt|z1:t1), (3.3)
where p(zt|xt) is the likelihood distribution that models the observation process, i.e. it
assesses the degree of support of the observation zt by the prediction xt. The likelihood
is given by a possibly nonlinear function of the state xt, and an independent identically
distributed noise process nt
zt = ht(xt, nt). (3.4)
The denominator of Eq. (3.3) is simply a normalization constant given by
p(zt|z1:t1) = p(zt, xt|xt)dxt = p(zt|xt)p(xt|z1:t1)dxt. (3.5)The posterior p(xt|z1:t) embodies all the available statistical information, allowing
the computation of an optimal estimate of the state vector xt, that contains the de-sired tracking information. Commonly used estimators are the Maximum A Posteriori
(MAP) and the Minimum Mean Square Error (MMSE), given respectively by
M AP :
xt = arg max
xt
p(xt|z1:t) (3.6)
MMSE: xt = E(p(xt|z1:t)) (3.7)Nevertheless, the optimal solution of the posterior probability, given by Eq. 3.3,
can not be determined analytically in practice, due to the nonlinearities and non-
Gaussianities of the prior information and observation models. Therefore, it is necessary
the use of suboptimal methods to obtain an approximate solution. In the Sec. 3.1.1, a
powerful and popular suboptimal method, call Particle Filtering, will be described.
19
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
40/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
3.1.1 Particle filter approximation
The Particle Filter is an approximate inference method based on Monte Carlo simula-
tion for solving Bayesian filters. In contrast to other approximate inference methods,
such as Extended Kalman Filters, Unscented Kalman Filters and Hidden Markov Mod-
els, Particle Filtering is able to deal with continuous state spaces and nonlinear/non-
Gaussian processes (9), which arise in a natural way in real tracking situations. The
Particle Filtering technique approximates the posterior probability p(xt|z1:t) by a set
of NS weighted random samples (or particles) {xit, i = 0,...,NS} (2)
p(xt|z1:t) 1
c
NSi=1
wit(xt xit), (3.8)
where the function (x) is the Kroneckers delta, {wit, i = 0,...,NS} is the set of weights
related to the samples, and c =NSi=1 wit is a normalization factor. As the number
of samples becomes very large, this approximation becomes equivalent to the true
posterior pdf.
Samples x
i
t and weights w
i
t are obtained using the concept of importance sam-pling (2; 79), which aims to reduce the variance of the approximation given by Eq. (3.8)
through Monte Carlo simulation. The set of samples {xit, i = 0,...,NS} is drawn from
a proposal distribution function q(xt|xt1, zt), called the importance density. The op-
timal q(xt|xt1, zt) should be proportional to p(xt|z1:t), and should have the same
support (the support of a function is the set of points where the function is not zero),
in whose case the variance would be zero. But this is only a theoretical solution, since
it would imply that p(xt|z1:t) is known. In practice, it is chosen a proposal distribution
as similar as possible to the posterior pdf, but there is not a standard solution, sinceit depends on the specific characteristics of the tracking application. The choice of
the proposal distribution is a key component in the design of Particle Filters, since
the quality of the estimation of the posterior pdf depends on the ability to find an
appropriate proposal distribution.
The weights wit related to each sample xit are recursively computed by (2)
wit = wit1
p(zt|xit)p(xit|xit1)
q(xit|xit1, zt)
. (3.9)
20
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
41/162
3.2 Bayesian tracking framework for moving cameras
The importance sampling principle has a serious drawback, called the degeneracy
problem (2), consisting in all the weights except one have an insignificant value after
a few iterations. In order to overcome this problem, several resampling techniques
have been proposed in the scientific literature, which introduce an additional sampling
step that consists in populating more times those samples that are more probable. A
popular resampling strategy is the Sampling Importance Resampling (SIR) algorithm,
which makes a random selection of the samples at each time step according to their
weights. Thus, the samples with higher weights are selected more times, while the ones
with an insignificant weight are discarded. After SIR resampling, all the samples havethe same weight.
3.2 Bayesian tracking framework for moving cameras
In video sequences acquired by a moving camera, the perceived motion of the objects is
composed by the own object motion and the camera motion. Consequently, it is neces-
sary to estimate the camera motion in order to obtain the object position. According
to this, the state vector xt = {dt, gt} must contain not only the object dynamics, dt,
(position and velocity over the image plane), but also the camera dynamics, gt, i.e. the
camera ego-motion. The posterior pdf of the state vector is recursively expressed by
the equations
p(xt|z1:t) =p(zt|xt)p(xt|z1:t1)
p(zt|z1:t1)(3.10)
p(xt|z1:t1) =
p(xt|xt1)p(xt1|z1:t1)dxt1. (3.11)
The transition probability p(xt|xt1) = p(dt, gt|dt1, gt1) encodes the information
about the object and camera dynamics, along with their uncertainty. If the camera
motion is not considered, the object dynamics can be modeled by the linear function
d
t = M d
t1, (3.12)
where M is a matrix that represents a first order linear system of constant velocity.
This object dynamic model is a reasonable approximation for a wide range of object
tracking applications, provided that the camera frame rate is enough high. The camera
21
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
42/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
dynamics is modeled by a geometric transformation gt that ideally is a projective cam-
era model, although, depending on the camera and scene disposition, it can be simplify
to an affine or Euclidean transformation. For example, in aerial tracking systems, an
affine geometric transformation is a satisfactory approximation of the projective cam-
era model, since the depth relief of the objects in the scene is small enough compared
to the average depth, and the field of view is also small (80). The joint dynamic model
for the camera and the object is expressed as a composition of both individual models
dt = gt M dt1. (3.13)
Based on this joint dynamic model, the transition probability p(xt|xt1) can be ex-
pressed as
p(xt|xt1) = p(dt, gt|dt1, gt1) = p(dt|dt1, gt1:t)p(gt|dt1, gt1)
= p(dt|dt1, gt)p(gt), (3.14)
where it has been assumed that, on the one hand, the current object position is condi-
tionally independent of the camera motion in the previous time step (as the proposed
joint dynamic model states), and, on the other hand, the current camera motion is
conditionally independent of both the camera motion and the object position in pre-
vious time steps. This last assumption results from the fact that the camera ego-
motion is completely random, not following any specific pattern. The probability term
p(dt|dt1, gt) models the uncertainty of the proposed joint dynamic model as
p(dt|dt1, gt) = Ndt; gt M dt1,
2tr
, (3.15)
where N(x; , 2) is a Gaussian or Normal distribution of mean and variance 2.
Thus, the term 2tr represents the unknown disturbances of the joint dynamic model.
The other probability term in Eq. 3.14, p(gt), expresses the probability that one
specific geometric transformation represents the true camera motion between consecu-
tive time steps. This is typically computed by a deterministic approach using an image
registration algorithm (27), which amounts to express p(gt) as
p(gt) = (gt gjt ), (3.16)
22
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
43/162
3.2 Bayesian tracking framework for moving cameras
where gjt is the geometric transformation obtained by the image registration technique.
However, this approximation can fail in situations in which the aperture problem (43;
81) is quite significant and/or the assumption of only one global motion does not hold,
for instance, in the presence of independent moving objects. Under these circumstances
there are several putative geometric transformations that can explain the camera ego-
motion. Moreover, the best geometric transformation according to some error or cost
function can not necessarily be the actual camera ego-motion, due to the noise and
non-linearities involved in the estimation process. In order to satisfactorily deal with
this situation, gt is addressed as a random variable, rather than a parameter computed
in a deterministic way. The specific computation of p(gt) depends of the tracking
application and type of imagery. Two different methods are proposed in Secs. 3.3 and
3.4 for infrared and visible imagery, respectively. In any case, they have in common
that they compute an approximation of p(gt) as
p(gt)
Ng
j=1 wjt (gt g
jt ), (3.17)
where Ng is the number of geometric transformations used to represent p(gt), {gjt |j =
1,...,Ng} are the best transformation candidates to model the camera ego-motion, and
wjt is the weight ofgjt , that evaluates how well the transformation represents the camera
ego-motion.
The likelihood function p(zt|xt) in Eq. 3.10 is dependent on the kind of imagery, and
the object type that is being tracked. Two different models have been developed, one
based on the detection of blob regions for infrared imagery, and another based on colorhistograms for visible video sequences, which are described respectively in Secs. 3.3
and 3.4. In general terms, the resulting likelihood will be non-Gaussian, nonlinear and
multi-modal, due to the presence of clutter and objects similar to the tracked object.
The initial pdf p(x0|z0) p(x0), called the prior, can be initialized as a Gaussian
distribution using the information given by an object detector algorithm, as in (18;
19; 32; 33; 34; 39; 40). Another alternative is to use the ground truth information (if
it is available) to initialize a Kroneckers delta function (x0).
23
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
44/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
3.3 Object tracking in aerial infrared imagery
This section presents the developed object tracking approach for aerial infrared imagery.
In contrast to visual-range images, infrared images have low signal-to-noise ratios, ob-
jects low contrasted with the background, and non-repeatable object signatures. These
drawbacks, along with the competing background clutter, and illumination changes
due to weather conditions, make the tracking task extremely difficult. On the other
hand, the unpredictable camera ego-motion, resulting from the fact that the camera
is on board of an aerial platform, distorts the spatio-temporal correlation of the video
sequence, negatively affecting the tracking performance.
All the aforementioned problems are addressed by a tracking strategy based on
the Bayesian tracking framework for moving cameras proposed in Sec. 3.2. According
to this, the posterior pdf of the state vector p(xt|z1:t1) is recursively computed by
Eqs. 3.10 and 3.11.
The transition probability p(xt|xt1), that encodes the joint camera and object dy-
namic model, is given by Eq. 3.14, where the prior probability p(gt) of the geometric
transformation was dependent on the specific type of imagery. For the ongoing tracking
application dealing with infrared imagery, the p(gjt ) of a specific geometric transfor-
mation gjt is based on the quality of the image alignment between consecutive frames
achieved by gjt . The quality of the image alignment (or the ego-motion compensation)
is computed by means of the Mean Square Error function, mse(x, y), between the cur-
rent frame It, and the previous frame It1 warped by the transformation gjt . Thus, the
probability p(gjt ) is mathematically expressed as
p(gjt ) = Nmse
It, gjt It1
; 0, 2g(3.18)
where N(x; , 2) is a Gaussian distribution of mean and variance 2, and 2g is the
expected variance of the image alignment process. Notice that It is an infrared intensity
image.
Finding an observation model for the likelihood p(zt|xt) in airborne infrared imagery
that appropriately describes the object appearance and its variations along the time,
is quite challenging due to the aforementioned characteristics of the infrared imagery.
The most robust and reliable object property is the presence of bright regions, or at
least, regions that are brighter than their surrounding neighborhood, which typically
24
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
45/162
3.3 Object tracking in aerial infrared imagery
(a) (b)
Figure 3.2: Two consecutive frames of an infrared sequence acquired by an airborne
camera.
correspond to the engine and exhaust pipe area of the object. Based on this fact,
the likelihood function uses an observation model that aims to detect the main bright
regions of the target. This is accomplished by a rotationally symmetric Laplacian of
Gaussian (LoG) filter, characterized by a sigma parameter that is tuned to the lowest
dimension of the object size, so that the filter response be maximum in the brightregions with a size similar to the tracked object. The main handicap of the observation
model is its lack of distinctiveness, since whatever bright region with an adequate
size can be the target object. As a consequence, the resulting LoG filter response is
strongly multi-modal. This fact, coupled with the camera ego-motion, dramatically
complicate a reliable estimation of the state vector. This situation is illustrated in
Figs. 3.2 and 3.3. The first one, Fig. 3.2, shows two consecutive frames,(a) and (b),
of an infrared sequence acquired by an airborne camera, in which the target object
has been enclosed by a rectangle. Fig. 3.3 shows the LoG filter response related to
Fig. 3.2(b), where the own image has been projected over the filter response for a
better interpretation. The multi-modality feature is clearly observed, and in theory
any of the modes could be the right object position. Moreover, if only the object
dynamics is considered, the closest mode to the predicted object location (marked by a
vertical black line) is not the true object location, because of the effects of the camera
ego-motion.
25
http://./3/figures/eps/chap_3_sec_2_fig_1_2.epshttp://./3/figures/eps/chap_3_sec_2_fig_1_1.eps -
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
46/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Figure 3.3: Multimodal LoG filter response related to Fig. 3.2(b).
Figure 3.4: Likelihood distribution related to Fig. 3.3.
The likelihood probability can be simplified by
p(zt|xt) = p(zt|dt, gt) = p(zt|dt), (3.19)
assuming that zt is conditionally independent ofgt given dt. Then, p(zt|dt) is expressed
by the Gaussian distribution
p(zt|dt) = N(zt; Hdt, 2L), (3.20)
where zt is the LoG filter response of the frame It, H is a matrix that selects the object
positional information, and the variance L is set to highlight the main modes of zt,
while discarding the less significant ones. This is illustrated in Fig. 3.4, where only the
most significant modes of Fig. 3.3 are highlighted.
26
http://./3/figures/eps/chap_3_sec_2_fig_3.epshttp://./3/figures/eps/chap_3_sec_2_fig_2.eps -
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
47/162
3.3 Object tracking in aerial infrared imagery
As both dynamic and observation models, are nonlinear and non-Gaussian, the
posterior pdf can not be analytically determined, and therefore, the use of approximate
inference methods is necessary. In the next section, a Particle Filtering strategy is
presented to obtain an approximate solution of the posterior pdf.
3.3.1 Particle filter approximation
The posterior pdf p(xt|z1:t) is approximated by means of a Particle Filter as
p(xt|z1:t) 1c
NSi=1
wit(xt xit), (3.21)
where the samples xit are drawn from a proposal distribution based on the likelihood
and the prior probability of the camera motion
q(xt|xt1, zt) = p(zt|dt)p(gt), (3.22)
which is an efficient simplification of the optimal, but not tractable, importance density
function (9)
q(xt|xt1, zt) = p(xt|xt1, zt). (3.23)
The samples xit = {dit, git} are drawn from the proposal distribution by a hierar-
chical sampling strategy. This, firstly, draws samples git from p(gt), and then, draws
samples dit from p(zt|dt). The sampling procedure for obtaining samples git from p(gt)
is based on an image registration algorithm presented in (82). This method assumes an
initial geometric transformation tit, and then, uses the whole image intensity informa-
tion to compute a global affine transformation git, which is a candidate for representing
the true camera motion. This method explicitly accounts for global variations in image
intensities to be robust to illumination changes. However, the computed candidate git
only will be a reasonable approximation of the camera motion if the initial geometric
transformation tit is close to the geometric transformation that represents the actual
camera motion. This means that the image in the previous time step, warped by the
initial transformation, must be closely aligned to the current image to achieve a satis-
factory result. This limitation derives from the optimization strategy used in the image
registration algorithm, that tends to the closest mode given an initial transformation.
27
-
8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective
48/162
3. BAYESIAN TRACKING WITH MOVING CAMERAS
As a consequence, if the two images are not closely aligned, the computed solution will
probably correspond to a local mode, that does not represent the true camera motion.
By default, tit is a 3 3 identity matrix that represents the previous image without
warping. This approach is inefficient in airborne visual tracking, since the camera can
undergo strong displacements that can not be satisfactorily compensated. To overcome
this problem, the previous image registration technique has been improved using several
initial geometric transformations {tit|i = 1,...,NS}, obtaining in turn a set of camera
ego-motion candidates {git|i = 1,...,NS}. The set of initial transformations are com-
puted with the purpose that at least one of them is relatively close to the actual cameramotion, so that the image registration algorithm can effectively compute the correct
geometric transformation. In this context, the concept of closeness between geometric
transformations depends, on the one hand, on the magnitude of the camera motion,
and, on the other hand, on the own capability of the image registration algorithm to
rectify misaligned images. For example, the ideal situation would be that the magni-
tude of the camera motion were lower than the maximum displacement that the image
registration algorithm is able to rectify. For the purpose of measuring the magnitude
of the camera motion, a subset of video sequences belonging to the AMCOM dataset(see Sec. 3.3.2) has been used as training set to compute the actual camera motion.
These sequences have been acquired by different infrared cameras on board a plane.
The computation process of the camera motion has been supervised by a user, which
not only guides the image alignment, but also evaluates if the reached result is enough
accurate to be considered the real camera motion. As a result, a set of affine transfor-
mations are obtained, which described the typical camera movements. Regarding the
image registration algorithm,