[lecture notes in computer science] pattern recognition volume 5748 ||

Lecture Notes in Computer Science 5748Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David HutchisonLancaster University, UK

Takeo KanadeCarnegie Mellon University, Pittsburgh, PA, USA

Josef KittlerUniversity of Surrey, Guildford, UK

Jon M. KleinbergCornell University, Ithaca, NY, USA

Alfred KobsaUniversity of California, Irvine, CA, USA

Friedemann MatternETH Zurich, Switzerland

John C. MitchellStanford University, CA, USA

Moni NaorWeizmann Institute of Science, Rehovot, Israel

Oscar NierstraszUniversity of Bern, Switzerland

C. Pandu RanganIndian Institute of Technology, Madras, India

Bernhard SteffenUniversity of Dortmund, Germany

Madhu SudanMicrosoft Research, Cambridge, MA, USA

Demetri TerzopoulosUniversity of California, Los Angeles, CA, USA

Doug TygarUniversity of California, Berkeley, CA, USA

Gerhard WeikumMax-Planck Institute of Computer Science, Saarbruecken, Germany

Joachim Denzler Gunther NotniHerbert Se (Eds.)

Pattern Recognition

31st DAGM SymposiumJena, Germany, September 9-11, 2009Proceedings

1 3

Volume Editors

Joachim DenzlerHerbert SeFriedrich-Schiller Universitt Jena, Lehrstuhl Digitale BildverarbeitungErnst-Abbe-Platz 2, 07743 Jena, GermanyE-mail: {joachim.denzler, herbert.suesse}@uni-jena.de

Gunther NotniFraunhofer-Institut fr Angewandte Optik und FeinmechanikAlbert-Einstein-Str. 7, 07745 Jena, GermanyE-mail: [email protected]

Library of Congress Control Number: 2009933619

CR Subject Classification (1998): I.5, I.4, I.3, I.2.10, F.2.2, I.4.8, I.4.1

LNCS Sublibrary: SL 6 Image Processing, Computer Vision, Pattern Recognition,and Graphics

ISSN 0302-9743ISBN-10 3-642-03797-6 Springer Berlin Heidelberg New YorkISBN-13 978-3-642-03797-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.

springer.com

Springer-Verlag Berlin Heidelberg 2009Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper SPIN: 12743339 06/3180 5 4 3 2 1 0

Preface

In 2009, for the second time in a row, Jena hosted an extraordinary event. In2008, Jena celebrated the 450th birthday of the Friedrich Schiller University ofJena with the motto Lichtgedanken flashes of brilliance. This year, foralmost one week, Jena became the center for the pattern recognition researchcommunity of the German-speaking countries in Europe by hosting the 31st

Annual Symposium of the Deutsche Arbeitsgemeinschaft fur Mustererkennung(DAGM).

Jena is a special place for this event for several reasons. Firstly, it is the firsttime that the university of Jena has been selected to host this conference, and itis an opportunity to present the city of Jena as offering a fascinating combinationof historic sites, an intellectual past, a delightful countryside, and innovative, in-ternational research and industry within Thuringia. Second, the conference takesplace in an environment that has been heavily influenced by optics research andindustry for more than 150 years. Third, in several schools and departments atthe University of Jena, research institutions and companies in the fields of pat-tern recognition, 3D computer vision, and machine learning play an importantrole. The universitys involvement includes such diverse activities as industrialinspection, medical image processing and analysis, remote sensing, biomedicalanalysis, and cutting-edge developments in the field of physics, such as the re-cent development of the new terahertz imaging technique. Thus, DAGM 2009was an important event to transfer basic research results to different applica-tions in such areas. Finally, the fact that the conference was jointly organizedby the Chair for Computer Vision of the Friedrich Schiller University of Jenaand the Fraunhofer Institute IOF reflects the strong cooperation between thesetwo institutions during the past and, more generally, between research, appliedresearch, and industry in this field. The establishment of a Graduate Schoolof Computer Vision and Image Interpretation, which is a joint facility of theTechnical University of Ilmenau and the Friedrich Schiller University of Jena,is a recent achievement that will focus and strengthen the computer vision andpattern recognition activities in Thuringia.

The technical program covered all aspects of pattern recognition and con-sisted of oral presentations and poster contributions, which were treated equallyand given the same number of pages in the proceedings. Each section is devotedto one specific topic and contains all oral and poster papers for this topic sortedalphabetically by first authors. A very strict paper selection process was used,resulting in an acceptance rate of less than 45%. Therefore, the proceedingsmeet the strict requirements for publication in the Springer Lecture Notes inComputer Science series. Although not reflected in these proceedings, one addi-tional point that also made this years DAGM special is the Young ResearchersForum, a special session for promoting scientific interactions between excellent

VI Preface

young researchers. The impressive scientific program of the conference is due tothe enormous efforts of the reviewers of the Program Committee. We thank allof those whose dedication and timely reporting helped to ensure that the highlyselective reviewing process was completed on schedule.

We are also proud to have had three renowned invited speakers at the con-ference:

Josef Kittler (University of Surrey, UK) Reinhard Klette (University of Auckland, New Zealand) Kyros Kutulakos (University of Toronto, Canada)

We extend our sincere thanks to everyone involved in the organization of thisevent, especially the members of the Chair for Computer Vision and the Fraun-hofer Institute IOF. In particular, we are indebted to Erik Rodner for organizingeverything related to the conference proceedings, to Wolfgang Ortmann for in-stallation and support in the context of the Web presentation and the reviewingand submission system, to Kathrin Mausezahl for managing the conference officeand arranging the conference dinner, and to Marcel Bruckner, Michael Kemmler,and Marco Korner for the local organization.

Finally, we would like to thank our sponsors, OLYMPUS Europe FoundationScience for Life, STIFT Thuringia, MVTec Software GmbH, Telekom Laborato-ries, Allied Vision Technologies, Desko GmbH, Jenoptik AG, and Optonet e.V.for their donations and helpful support, which contributed to several awardsat the conference and made reasonable registration fees possible. We especiallyappreciate support from industry because it indicates faithfulness to our com-munity and recognizes the importance of pattern recognition and related areasto business and industry.

We were happy to host the 31st Annual Symposium of DAGM in Jena andlook forward to DAGM 2010 in Darmstadt.

September 2009 Joachim DenzlerGunther NotniHerbert Sue

Organization

Program Committee

T. Aach RWTH AachenH. Bischof TU GrazJ. Buhmann ETH ZurichH. Burkhardt University of FreiburgD. Cremers University of BonnJ. Denzler University of JenaG. Fink TU DortmundB. Flach TU DresdenW. Forstner University of BonnU. Franke Daimler AGM. Franz HTWG KonstanzD. Gavrila Daimler AGM. Goesele TU DarmstadtF.A. Hamprecht University of HeidelbergJ. Hornegger University of ErlangenB. Jahne University of HeidelbergX. Jiang University of MunsterR. Koch University of KielU. Kothe University of HeidelbergW.G. Kropatsch TU WienG. Lin TU IlmenauH. Mayer BW-Universitat MunchenR. Mester University of FrankfurtB. Michaelis University of MagdeburgK.-R. Muller TU BerlinH. Ney RWTH AachenG. Notni Fraunhofer IOF JenaK. Obermayer TU BerlinG. Ratsch MPI TubingenG. Rigoll TU MunchenK. Rohr University of HeidelbergB. Rosenhahn University of HannoverS. Roth TU DarmstadtB. Schiele University of DarmstadtC. Schnorr University of HeidelbergB. Scholkopf MPI TubingenG. Sommer University of KielT. Vetter University of BaselF.M. Wahl University of BraunschweigJ. Weickert Saarland University

Prizes 2007

Olympus Prize

The Olympus Prize 2007 was awarded to

Bodo Rosenhahn and Gunnar Ratsch

for their outstanding contributions to the area of computer vision and machinelearning.

DAGM Prizes

The main prize for 2007 was awarded to:

Jurgen Gall, Bodo Rosenhahn, Hans-Peter Seidel: Clustered Stochastic Op-timization for Object Recognition and Pose Estimation

Christopher Zach, Thomas Pock, Horst Bischof: A Duality-Based Approachfor Realtime TV-L1 Optical Flow

Further DAGM prizes for 2007 were awarded to:

Kevin Koser, Bogumil Bartczak, Reinhard Koch: An Analysis-by-SynthesisCamera Tracking Approach Based on Free-Form Surfaces

Volker Roth, Bernd Fischer: The kernelHMM : Learning Kernel Combinationsin Structured Output Domains

Prizes 2008

Olympus Prize

The Olympus Prize 2008 was awarded to

Bastian Leibe

for his outstanding contributions to the area of closely coupled object catego-rization, segmentation, and tracking.

DAGM Prizes

The main prize for 2008 was awarded to:

Christoph H. Lampert, Matthew B. Blaschko: A Multiple Kernel LearningApproach to Joint Multi-class Object Detection

Further DAGM prizes for 2008 were awarded to:

Bjorn Andres, Ullrich Kothe, Moritz Helmstadter, Winfried Denk, Fred A.Hamprecht : Segmentation of SBFSEM Volume Data of Neural Tissue byHierarchical Classification

Kersten Petersen, Janis Fehr, Hans Burkhardt : Fast Generalized Belief Prop-agation for MAP Estimation on 2D and 3D Grid-Like Markov RandomFields

Kai Krajsek, Rudolf Mester, Hanno Scharr : Statistically Optimal Averagingfor Image Restoration and Optical Flow Estimation

Table of Contents

Motion and Tracking

A 3-Component Inverse Depth Parameterization for Particle FilterSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Evren Imre and Marie-Odile Berger

An Efficient Linear Method for the Estimation of Ego-Motion fromOptical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Florian Raudies and Heiko Neumann

Localised Mixture Models in Region-Based Tracking . . . . . . . . . . . . . . . . . . 21Christian Schmaltz, Bodo Rosenhahn, Thomas Brox, andJoachim Weickert

A Closed-Form Solution for Image Sequence Segmentation withDynamical Shape Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Frank R. Schmidt and Daniel Cremers

Markerless 3D Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Christian Walder, Martin Breidt, Heinrich Bulthoff,Bernhard Scholkopf, and Cristobal Curio

Pedestrian Recognition and Automotive Applications

The Stixel World - A Compact Medium Level Representation of the3D-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Hernan Badino, Uwe Franke, and David Pfeiffer

Global Localization of Vehicles Using Local Pole Patterns . . . . . . . . . . . . . 61Claus Brenner

Single-Frame 3D Human Pose Recovery from Multiple Views . . . . . . . . . . 71Michael Hofmann and Dariu M. Gavrila

Dense Stereo-Based ROI Generation for Pedestrian Detection . . . . . . . . . . 81Christoph Gustav Keller, David Fernandez Llorca, and DariuM. Gavrila

Pedestrian Detection by Probabilistic Component Assembly . . . . . . . . . . . 91Martin Rapus, Stefan Munder, Gregory Baratoff, andJoachim Denzler

High-Level Fusion of Depth and Intensity for PedestrianClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Marcus Rohrbach, Markus Enzweiler, and Dariu M. Gavrila

XII Table of Contents

Features

Fast and Accurate 3D Edge Detection for Surface Reconstruction . . . . . . 111Christian Bahnisch, Peer Stelldinger, and Ullrich Kothe

Boosting Shift-Invariant Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Thomas Hornlein and Bernd Jahne

Harmonic Filters for Generic Feature Detection in 3D . . . . . . . . . . . . . . . . 131Marco Reisert and Hans Burkhardt

Increasing the Dimension of Creativity in Rotation Invariant FeatureDesign Using 3D Tensorial Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Henrik Skibbe, Marco Reisert, Olaf Ronneberger, and Hans Burkhardt

Training for Task Specific Keypoint Detection . . . . . . . . . . . . . . . . . . . . . . . 151Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua

Combined GKLT Feature Tracking and Reconstruction for Next BestView Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Michael Trummer, Christoph Munkelt, and Joachim Denzler

Single-View and 3D Reconstruction

Non-parametric Single View Reconstruction of Curved Objects UsingConvex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Martin R. Oswald, Eno Toppe, Kalin Kolev, and Daniel Cremers

Discontinuity-Adaptive Shape from Focus Using a Non-convex Prior . . . . 181Krishnamurthy Ramnath and Ambasamudram N. Rajagopalan

Making Shape from Shading Work for Real-World Images . . . . . . . . . . . . . 191Oliver Vogel, Levi Valgaerts, Michael Breu, and Joachim Weickert

Learning and Classification

Deformation-Aware Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Tobias Gass, Thomas Deselaers, and Hermann Ney

Multi-view Object Detection Based on Spatial Consistency in a LowDimensional Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Gurman Gill and Martin Levine

Active Structured Learning for High-Speed Object Detection . . . . . . . . . . 221Christoph H. Lampert and Jan Peters

Face Reconstruction from Skull Shapes and Physical Attributes . . . . . . . . 232Pascal Paysan, Marcel Luthi, Thomas Albrecht, Anita Lerch,Brian Amberg, Francesco Santini, and Thomas Vetter

Table of Contents XIII

Sparse Bayesian Regression for Grouped Variables in GeneralizedLinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

Sudhir Raman and Volker Roth

Learning with Few Examples by Transferring Feature Relevance . . . . . . . 252Erik Rodner and Joachim Denzler

Pattern Recognition and Estimation

Simultaneous Estimation of Pose and Motion at Highly Dynamic TurnManeuvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

Alexander Barth, Jan Siegemund, Uwe Franke, andWolfgang Forstner

Making Archetypal Analysis Practical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272Christian Bauckhage and Christian Thurau

Fast Multiscale Operator Development for Hexagonal Images . . . . . . . . . . 282Bryan Gardiner, Sonya Coleman, and Bryan Scotney

Optimal Parameter Estimation with Homogeneous Entities andArbitrary Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

Jochen Meidow, Wolfgang Forstner, and Christian Beder

Detecting Hubs in Music Audio Based on Network Analysis . . . . . . . . . . . 302Alexandros Nanopoulos

A Gradient Descent Approximation for Graph Cuts . . . . . . . . . . . . . . . . . . 312Alparslan Yildiz and Yusuf Sinan Akgul

Stereo and Multi-view Reconstruction

A Stereo Depth Recovery Method Using Layered Representation of theScene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

Tarkan Aydin and Yusuf Sinan Akgul

Reconstruction of Sewer Shaft Profiles from Fisheye-Lens CameraImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

Sandro Esquivel, Reinhard Koch, and Heino Rehse

A Superresolution Framework for High-Accuracy MultiviewReconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

Bastian Goldlucke and Daniel Cremers

View Planning for 3D Reconstruction Using Time-of-Flight CameraData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

Christoph Munkelt, Michael Trummer, Peter Kuhmstedt,Gunther Notni, and Joachim Denzler

XIV Table of Contents

Real Aperture Axial Stereo: Solving for Correspondences in Blur . . . . . . . 362Rajiv Ranjan Sahay and Ambasamudram N. Rajagopalan

Real-Time GPU-Based Voxel Carving with Systematic OcclusionHandling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

Alexander Schick and Rainer Stiefelhagen

Image-Based Lunar Surface Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 382Stephan Wenger, Anita Sellent, Ole Schutt, and Marcus Magnor

Image Analysis and Applications

Use of Coloured Tracers in Gas Flow Experiments for a LagrangianFlow Analysis with Increased Tracer Density . . . . . . . . . . . . . . . . . . . . . . . . 392

Christian Bendicks, Dominique Tarlet, Bernd Michaelis,Dominique Thevenin, and Bernd Wunderlich

Reading from Scratch A Vision-System for Reading Data onMicro-structured Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

Ralf Dragon, Christian Becker, Bodo Rosenhahn, andJorn Ostermann

Diffusion MRI Tractography of Crossing Fibers by Cone-Beam ODFRegularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

Hans-Heino Ehricke, Kay M. Otto, Vinoid Kumar, and Uwe Klose

Feature Extraction Algorithm for Banknote Textures Based onIncomplete Shift Invariant Wavelet Packet Transform . . . . . . . . . . . . . . . . . 422

Stefan Glock, Eugen Gillich, Johannes Schaede, and Volker Lohweg

Video Super Resolution Using Duality Based TV-L1 Optical Flow . . . . . . 432Dennis Mitzel, Thomas Pock, Thomas Schoenemann, andDaniel Cremers

HMM-Based Defect Localization in Wire Ropes A New Approach toUnusual Subsequence Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

Esther-Sabrina Platzer, Josef Nagele, Karl-Heinz Wehking, andJoachim Denzler

Beating the Quality of JPEG 2000 with Anisotropic Diffusion . . . . . . . . . 452Christian Schmaltz, Joachim Weickert, and Andres Bruhn

Decoding Color Structured Light Patterns with a Region AdjacencyGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

Christoph Schmalz

Residual Images Remove Illumination Artifacts! . . . . . . . . . . . . . . . . . . . . . 472Tobi Vaudrey and Reinhard Klette

Table of Contents XV

Superresolution and Denoising of 3D Fluid Flow Estimates . . . . . . . . . . . . 482Andrey Vlasenko and Christoph Schnorr

Spatial Statistics for Tumor Cell Counting and Classification . . . . . . . . . . 492Oliver Wirjadi, Yoo-Jin Kim, and Thomas Breuel

Segmentation

Quantitative Assessment of Image Segmentation Quality by RandomWalk Relaxation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

Bjorn Andres, Ullrich Kothe, Andreea Bonea, Boaz Nadler, andFred A. Hamprecht

Applying Recursive EM to Scene Segmentation . . . . . . . . . . . . . . . . . . . . . . 512Alexander Bachmann

Adaptive Foreground/Background Segmentation Using MultiviewSilhouette Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522

Tobias Feldmann, Lars Dieelberg, and Annika Worner

Evaluation of Structure Recognition Using Labelled Facade Images . . . . . 532Nora Ripperda and Claus Brenner

Using Lateral Coupled Snakes for Modeling the Contours of Worms . . . . 542Qing Wang, Olaf Ronneberger, Ekkehard Schulze,Ralf Baumeister, and Hans Burkhardt

Globally Optimal Finsler Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . 552Christopher Zach, Liang Shan, and Marc Niethammer

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563

A 3-Component Inverse Depth Parameterizationfor Particle Filter SLAM

Evren Imre and Marie-Odile Berger

INRIA Grand Est- Nancy, France

Abstract. The non-Gaussianity of the depth estimate uncertainty de-grades the performance of monocular extended Kalman filter SLAM(EKF-SLAM) systems employing a 3-component Cartesian landmark pa-rameterization, especially in low-parallax configurations. Even particlefilter SLAM (PF-SLAM) approaches are affected, as they utilize EKFfor estimating the map. The inverse depth parameterization (IDP) alle-viates this problem through a redundant representation, but at the priceof increased computational complexity. The authors show that such a re-dundancy does not exist in PF-SLAM, hence the performance advantageof the IDP comes almost without an increase in the computational cost.

1 Introduction

The monocular simultaneous localization and mapping (SLAM) problem involvesthe causal estimation of the location of a set of 3D landmarks in an unknownenvironment (mapping), in order to compute the pose of a sensor platform withinthis environment (localization), via the photometric measurements acquired bya camera, i.e. the 2D images [2]. Since the computational complexity of thestructure-from-motion techniques, such as [6], is deemed prohibitively high, theliterature is dominated by extended Kalman filter (EKF) [2],[3] and particlefilter (PF) [4] based approaches. The former utilizes an EKF to estimate thecurrent state, defined as the pose and the map, using all past measurements [5].The latter exploits the independence of the landmarks, given the trajectory, todecompose the SLAM problem into the estimation of the trajectory via PF, andthe individual landmarks via EKF [5].

Since both approaches use EKF, they share a common problem: EKF assumesthat the state distribution is Gaussian. The validity of this assumption, hencethe success of EKF in a particular application, critically depends on the linearityof the measurement function. However, the measurement function in monocu-lar SLAM, the pinhole camera model [2], is known to be highly nonlinear forlandmarks represented with the Cartesian parameterization (CP) [9], i.e., withtheir components along the 3 orthonormal axes corresponding to the 3 spatial di-mensions. This is especially true for low-parallax configurations, which typicallyoccurs in case of distant, or newly initialized landmarks [9].

A well-known solution to this problem is to employ an initialization stage, byusing a particle filter [10], or simplified linear measurement model [4], and then to

J. Denzler, G. Notni, and H. Sue (Eds.): DAGM 2009, LNCS 5748, pp. 110, 2009.c Springer-Verlag Berlin Heidelberg 2009

2 E. Imre and M.-O. Berger

switch to the CP. The IDP [7] further refines the initialization approach: it usesthe actual measurement model, hence is more accurate than [4]; computationallyless expensive than [10]; and needs no special procedure to constrain the posewith the landmarks in the initialization stage, hence simpler than both [4] and[10]. Since it is more linear than the CP [7], it also offers a performance gainboth in low- and high-parallax configurations. However, since EKF is an O(N2)algorithm, the redundancy of the IDP limits its use to the low-parallax case.

The main contribution of this work is to show that in PF-SLAM, the perfor-mance gain from the use of the IDP is almost free: PF-SLAM operates under theassumption that for each particle, the trajectory is known. Therefore the poserelated components of the IDP should be removed from the state of the land-mark EKF, leaving exactly the same number of components as the CP. Sincethis parameterization has no redundancy, and has better performance than theCP [7], its benefits can be enjoyed throughout the entire estimation procedure,not just during the landmark initialization.

The organization of the paper is as follows: In the next section, the PF-SLAM system used in this study is presented. In Sect. 3, the application of IDPto PF-SLAM is discussed, and compared with [4]. The experimental results arepresented in Sect. 4, and Sect. 5 concludes the paper.

1.1 Notation

Throughout the rest of the paper, a matrix and a vector is represented by anuppercase and a lowercase bold letter, respectively. A standalone lowercase italicletter denotes a scalar, and one with paranthesis stands for a function. Finally,an uppercase italic letter corresponds to a set.

2 A Monocular PF-SLAM System

PF-SLAM maximizes the SLAM posterior over the entire trajectory of the cam-era and the map, i.e, the objective function is [5]

pPF = p(X,M |Z), (1)where X and M denotes the camera trajectory and the map estimate at the kthtime instant, respectively (in (1) the subscript k is suppressed for brevity). Z isthe collection of measurements acquired until k.

Equation 1 can be decomposed as [1]

pPF = p(X |Z)p(M |X,Z). (2)In PF-SLAM, the first term is evaluated by a PF, which generates a set oftrajectory hypotheses. Then, for a given trajectory X i , the second term can beexpanded as [1]

p(M i|X i, Z) =

j=1

p(mij|X i, Z), (3)

A 3-Component Inverse Depth Parameterization for Particle Filter SLAM 3

where is the total number of landmarks, and M i is the map estimate of theparticle i , computed from X i , via EKF. Therefore, for a - particle system,(2) is maximized by a particle filter and independent EKFs [1]. When a -parameter landmark representation is employed, the computational complexityis O(2).

In the system utilized in this work, X is the history of the pose and the rateof displacement estimates. Its kth member is

sk = [ck qk]xk = [sk tk wk],

(4)

where ck and qk denote the position of the camera center in 3D world coordi-nates, and, its orientation as a quaternion, respectively. Together, they form thepose sk. tk and wk are the translational and rotational displacement terms, interms of distance and angle covered in a single time unit.

M is defined as a collection of 3D point landmarks, i.e.,

M = {mj}j=1 (5)

The state evolves with respect to the constant velocity model, defined as

ck+1 = ck + tkqk+1 = qk q(wk)

tk+1 = tk + vtwk+1 = wk + vw,

(6)

where q is an operator that maps an Euler angle vector to a quaternion, and is the quaternion product operation. vt and vw are two independent Gaussiannoise processes with covariance matrices Pt and Pw, respectively.

The measurement function projects a 3D point landmark to a 2D point featureon the image plane via the perspective projection equation [9], i.e.,

[hx hy hz]T = r(qk1)[mj ck]Tzj = [x xhxhz y y

hyhz

],(7)

where r(q) is an operator that yields the rotation matrix corresponding to aquaternion q, and zj is the projection of the jth landmark to the image plane.(x; y) denotes the principal point of the camera, and (x;y) represents thefocal length-related scale factors.

The implementation follows the FASTSLAM2.0 [1] adaptation, described in[4]. In a cycle, first the particle poses are updated via (6). Then, the measurementpredictions and the associated search regions are constructed. After matchingwith normalized cross correlation, the pose and the displacement estimates ofall particles are updated with the measurements, zk. The quality of each par-ticle is assessed by the observation likelihood function p(zk|X i,M i), evaluatedat zk. The resampling stage utilizes this quality score to form the new par-ticle set. Finally, for each particle X i, the corresponding map M i is updated


with the measurements. The algorithm tries to maintain a certain number ofactive landmarks (i.e. landmarks that are in front of the camera, and have theirmeasurement predictions within the image), and uses FAST [8] to detect newlandmarks to replace the lost ones. The addition and the removal operations areglobal, i.e., if a landmark is deleted, it is removed from the maps of all particles.

3 Inverse-Depth Parameterization and PF-SLAM

The original IDP represents a 3D landmark,m3D, as a point on the ray that joinsthe landmark, and the camera center of the first camera in which the landmarkis observed [9], i.e.,

m3D = c+1n, (8)

where c is the camera center, n is the direction vector of the ray and isthe inverse of the distance from c. n is parameterized by the azimuth and theelevation angles of the ray, and , as

n = [cos sin sin cos cos ], (9)computed from the orientation of the first camera, and the first 2D observation,q and u, respectively. The resulting 6-parameter representation, IDP6, is

mIDP6 = [c (u,q) (u,q) ]. (10)

This formulation, demonstrated to be superior to the CP [7], has two short-comings. Firstly, it is a 6-parameter representation, hence its use in the EKF iscomputationally more expensive. Secondly, u and q are not directly represented,and their nonlinear relation to and [9] inevitably introduces an error.

The latter issue can be remedied by a representation which deals with thesehidden variables explicitly, i.e., a 10-component parameterization,

mIDP10 = [c q u ]. (11)

In this case, n is redefined as

l = r(q)[x u1xy u2

y 1]T

n = ll .(12)

With these definitions, the likelihood of a landmark in a particle, i.e., the operandin (3), is

p(mij|X i, Z) = p(sij,uij, ij |X i, Z). (13)Consider a landmark mj is initiated at the time instant k a, with a > 0. Bydefinition, sij is the pose hypothesis of the particle i at k a, i.e., sika (see(4) ). Since, for a particle, the trajectory is given, this entity has no associateduncertainty, hence, is not updated by the landmark EKF. Therefore,

sik = sika

sika xika X i} p(mij|X i, Z) = p(uij, ij |X i, Z). (14)


In other words, the pose component of a landmark in a particle is a part ofthe trajectory hypothesis, and is fixed for a given particle. Therefore, it can beremoved from the state vector of the landmark EKF. The resulting parameteri-zation, IDP3, is

mIDP3 = [u ]. (15)

Since the linearity analysis of [7] involves only the derivatives of the inverse depthparameter, it applies to all parameterizations of the form (8). Therefore, IDP3retains the performance advantage of IDP6 over CP. As for the complexity, IDP3and CP differ only in the measurement functions and their Jacobians. Equation(7) can be evaluated in 58 floating point operations (FLOP), whereas when (12)and (8) is substituted into (7), considering that some of the terms are fixed atthe instantiation, the increase is 13 FLOPs, plus a square root. Similar figuresapply to the Jaocbians. To put the above into perspective, the rest of the stateupdate equations of the EKF can be evaluated roughly in 160 FLOPs. Therefore,in PF-SLAM, the performance vs. computational cost trade-off that limits theapplication of IDP6 is effectively eliminated, there is no need for a dedicatedinitialization stage, and IDP3 can be utilized throughout the entire process.Besides, IDP3 involves no approximations over CP, it only exploits a propertyof particle filters.

A similar, 3-component parameterization is proposed in [4]. However, theauthors employ it in an initialization stage, in which a simplified measurementfunction that assumes no rotation, and translation only along the directionsorthogonal to the principal axis vector is utilized. This approximation yields alinear measurement function, and makes it possible to use a linear Kalman filter,a computationally less expensive scheme than EKF. The approach proposed inthis work, employing IDP3 exclusively, has the following advantages:

1. IDP is utilized throughout the entire process, not only in the initialization.2. No separate landmark initialization stage is required, therefore the system

architecture is simpler.3. The measurement function employed in [4] is valid only within a small neigh-

borhood of the original pose [4]. The approximation not only adversely af-fects the performance, but also limits the duration in which a landmark maycomplete its initialization. However, the proposed approach uses the actualmeasurement equation, whose validity is not likewise limited.

4 Experimental Results

The performance of the proposed parameterization is assessed via a pose esti-mation task. For this purpose, a bed, which can translate and rotate a camerain two axes with a positional and angular precision of 0.48 mm and 0.001o,respectively, is used to acquire the images of two indoor scenes, with the di-mensions 4x2x3 meters, at a resolution of 640x480. In the sequence Line, thecamera moves on an 63.5-cm long straight path, with a constant translationaland angular displacement of 1.58 mm/frame and 0.0325o/frame, respectively.


Fig. 1. Left : The bed used in the experiment to produce the ground truth trajectory.Right top: The first and the last images of Line and Hardline Right bottom: Two imagesfrom the circle the camera traced in Circle.

Hardline is derived from Line by discarding 2/3 of the images randomly, in or-der to obtain a nonconstant-velocity motion. The sequence Circle is acquired bya camera tracing a circle with a diameter of 73 cm (i.e. a circumference of 229cm), and moving at a displacement of 3.17 mm/frame. It is the most challengingsequence of the entire set, as, unlike Hardline, not only the horizontal and for-ward components of the displacement, but also the direction changes. Figure 1depicts the setup, and two images from each of the sequences.

The pose estimation task involves recovering the pose and the orientation ofthe camera from the image sequences by using the PF-SLAM algorithm describedin Sect. 2. Two map representations are compared: the exclusive use of the IDP3and a hybrid CP-IDP3 scheme. The hybrid scheme involves converting an IDP3landmark to the CP representation as soon as a measure of the linearity of themeasurement function, the linearity index, proposed in [7], goes below 0.1 [7].At a given time, the system may have both CP and IDP3 landmarks in themaps of the particles, hence the name hybrid CP-IDP3. It is related to [4] in thesense that both use the same landmark representation for initialization, howeverthe hybrid CP-IDP3 employs the actual measurement model, hence is expectedto perform better than [4]. Therefore, it is safe to state that the experimentscompare IDP3 to an improved version of [4].

In the experiments, the number of particles is set to 2500, and both algorithmstry tomaintain 30 landmarks. Although this may seem low, given the capabilitiesof the contemporary monocular SLAM systems, since the main argument of thiswork is totally independent of the number of landmarks, the authors believethat denser maps would not enhance the discussion.

Two criteria are used for the evaluation of the results:

1. Position error: Square root of the mean square error between the groundtruth and the estimated trajectory, in milimeters.

2. Orientation error: The angle between the the estimated and the actualnormals to the image plane (i.e., the principal axis vectors), in degrees.

The results are presented in Table 1 and Figs. 2-4.


Table 1. Mean trajectory and principal axis errors

Criterion Line Hardline CircleIDP3 CP-IDP3 IDP3 CP-IDP3 IDP3 CP-IDP3

Mean trajectory error (mm) 7.58 11.25 8.15 12.46 22.66 39.87Principal axis error (degrees) 0.31 0.57 0.24 0.54 0.36 0.48

The experiment results indicate that both schemes perform satisfactorily inLine and Hardline. The IDP3 performs slightly, but consistently, better in bothposition and orientation estimates, with an average position error below 1 cm.As for the orientation error, in both cases, the IDP3 yields an error oscillat-ing around 0.3o, whereas, in the CP-IDP3, it grows towards 1o, as the cameramoves. However, in Circle, the performance difference is much more pronounced:the IDP3 can follow the circle, the true trajectory, much closely than the CP-IDP3. The average and peak differences are approximately 1.7 and 4 cm, re-spectively. The final error in both algorithms are less than 2% of the total pathlength.

The superiority of the IDP3 can be primarily attributed to two factors: thenonlinearity of (8) and the relatively high nonlinearity of (6), when mj is repre-sented via the CP, instead of the IDP [9].

The first issue affects the conversion from the CP to the IDP3. Since thetransformation is nonlinear, the conversion of the uncertainty of an IDPlandmark to the the corresponding CP landmark is not error-free. The secondproblem, the relative nonlinearity, implies that the accumulation of the lineariza-tion errors occurs at a higher rate in a CP landmark than an IDP landmark.Since the quality of the landmark estimates are reflected in the accuracy of theestimated pose [7], IDP3 performs better. The performance difference is notsignificant in Line (Fig. 4), a relatively easy sequence in which the constanttranslational and angular displacement assumptions are satisfied, as seen in Ta-ble 1. Although Hardline (Figs. 2, 3 and 4) is a more difficult sequence, theuncertainty in the translation component is still constrained to a line, and PFcan cope with the variations in the total displacement magnitude. Besides, itis probably somewhat short to illustrate the effects of the drift: the divergingorientation error observed in Figs. 2, 3 and 4 is likely to cause problems in alonger sequence. However, in Circle (Figs. 2, 3 and 4), there is a considerableperformance gap. It is a sequence in which neither the direction, nor the com-ponents of the displacement vector are constant. Therefore the violation of theconstant displacement assumption is the strongest among all sequences. More-over, at certain parts of the sequence, the camera motion has a substantialcomponent along the principal axis vector of the camera, a case in which thenonlinear nature of (6) is accentuated. A closer study of Fig. 3 reveals that itis these parts of the sequence, especially in the second half of the trajectory, inwhich the IDP3 performs better than the CP-IDP3 scheme, due to its superiorlinearization.


Fig. 2. Top view of the trajectory and the structure estimates. Left:Hardline Right:Circle. G denotes the ground truth. Blue circles indicate the estimated landmarks.

Fig. 3. Trajectory and orientation estimates for Hardline (top) and Circle (bottom).Left: Trajectory. Right: Orientation, i.e, the principal axis. In order to prevent clutter-ing, the orientation estimates are downsampled by 4.


Fig. 4. Performance comparison of the IDP3 and the CP-IDP3 schemes. Top: Line.Middle: Hardline. Bottom: Circle. Left column is the Euclidean distance between theactual and the estimated trajectories. Right column is the angle between the actualand the estimated principal axis vectors.


5 Conclusion

The advantage the IDP offers over the CP, the relative amenability to lineariza-tion, is a prize that comes at the price of reduced representation efficiency, asthe CP describes a landmark with the minimum number of components, whereasthe IDP has redundant components. In this paper, the authors show that, thisis not the case in PF-SLAM, i.e., the IDP is effectively as efficient as the CP, byexploiting the fact that in a PF-SLAM system, for each particle, the trajectoryis given, i.e., has no uncertainty, therefore, any pose-related parameters can beremoved from the landmark EKFs. This allows the use of the IDP throughoutthe entire estimation procedure. In addition to reducing the linearization errors,this parameterization strategy removes the need for a separate feature initial-ization procedure, hence also reduces the system complexity, and eliminates theerrors introduced in transferring the uncertainty from one parameterization toanother. The experimental results demonstrate the superiority of the proposedapproach to a hybrid CP-IDP scheme.

References

1. Montemerlo, M.: FastSLAM: A Factored Solution to the Simultaneous Localizationand Mapping, Ph. D. dissertation, School of Computer Science, Carnegie MellonUniversity, Pittsburgh, PA (2003)

2. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: MonoSLAM: Real-Time SingleCamera SLAM. IEEE Trans. Pattern Analysis and Machine Intelligence 29(6),10521067 (2007)

3. Jin, H., Favaro, P., Soatto, S.: A Semi-Direct Approach to Structure from Motion.The Visual Computer 19(6), 377394 (2003)

4. Eade, E., Drummond, T.: Scalable Monocular SLAM. In: CVPR 2006, pp. 469476(2006)

5. Durrant-Whyte, H., Bailey, T.: Simultaneous Localization and Mapping: Part I.IEEE Robotics and Automation Mag. 13(2), 9110 (2006)

6. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops,J., Koch, R.: Visual Modeling with a Hand-Held Camera. Intl. J. Computer Vi-sion 59(3), 207232 (2004)

7. Civera, J., Davison, A.J., Montiel, J.M.M.: Inverse Depth to Depth Conversion forMonocular SLAM. In: ICRA 2007, pp. 27782783 (2007)

8. Rosten, E., Drummond, T.: Fusing Points and Lines for High Performance Track-ing. In: ICCV 2005, pp. 15081515 (2005)

9. Civera, J., Davison, A.J., Montiel, J.M.M.: Unified Inverse Depth Parameterizationfor Monocular SLAM. In: RSS 2006 (2006)

10. Davison, A.J.: Real-Time Simultaneous Localization and Mapping with a SingleCamera. In: ICCV 2003, vol. 2, pp. 14031410 (2003)

An Efficient Linear Method for the Estimationof Ego-Motion from Optical Flow

Florian Raudies and Heiko Neumann

Institute of Neural Information ProcessingUniversity of Ulm

89069 Ulm, Germany

Abstract. Approaches to visual navigation, e.g. used in robotics, re-quire computationally efficient, numerically stable, and robust methodsfor the estimation of ego-motion. One of the main problems for ego-motion estimation is the segregation of the translational and rotationalcomponent of ego-motion in order to utilize the translation component,e.g. for computing spatial navigation direction. Most of the existingmethods solve this segregation task by means of formulating a nonlinearoptimization problem. One exception is the subspace method, a well-known linear method, which applies a computationally high-cost singu-lar value decomposition (SVD). In order to be computationally efficienta novel linear method for the segregation of translation and rotationis introduced. For robust estimation of ego-motion the new method isintegrated into the Random Sample Consensus (RANSAC) algorithm.Different scenarios show perspectives of the new method compared toexisting approaches.

1 Motivation

For many applications visual navigation and ego-motion estimation is of primeimportance. Here, processing starts with the estimation of optical flow using amonocular spatio-temporal image sequences as input followed by the estimationof ego-motion.

Optical flow fields generated by ego-motion of the observer are getting morecomplex if one or multiple objects move independently of ego-motion. A chal-lenging task is to segregate such moving objects (IMOs), where MacLean et al.proposed a combination of ego-motion estimation and the Expectation Maxi-mization (EM) algorithm [15]. With this algorithm a single motion model isestimated for ego-motion and each IMO using the subspace method [9]. A keyfunctionality of the subspace method is the possibility to cluster ego-motion andmotion of IMOs. More robust approaches assume noisy flow estimates besidesIMOs when estimating ego-motion with the EM algorithm [16,5]. Generally, theEM algorithm uses an iterative computational scheme and in each iteration theevaluation of the method estimating ego-motion is required. This necessitates acomputationally highly efficient algorithm for the estimation of ego-motion inreal-time applications. So far, many of the ego-motion algorithms introduced inthe past lack this property of computationally efficiency.


12 F. Raudies and H. Neumann

Bruss and Horn derived a bilinear constraint to estimate ego-motion by uti-lizing a quadratic Euclidian metric to calculate errors between input flow andmodel flow [3]. The method is linear w.r.t. either translation or rotation andindependent of depth. This bilinear constraint was used throughout the last twodecades for ego-motion estimation: (i) Heeger and Jepson built their subspacemethod upon this bilinear constraint [9]. (ii) Chiuso et al. used a fix-point itera-tion to optimize between rotation (based on the bilinear constraint), depth, andtranslation, [4] and Pauwels and Van Hulle used the same iteration mechanismoptimizing for rotation and translation (both based on the bilinear constraint)[16]. (iii) Zhang and Tomasi as well as Pauwels and Van Hulle used a Gauss-Newton iteration between rotation, depth, and translation [20,17]. In detail themethod (i) needs singular value decomposition, and methods of (ii) and (iii)iterative optimization techniques.

Here, a novel linear approach for the estimation of ego-motion is presented.Our approach utilizes the bilinear constraint, the basis of many nonlinear meth-ods. Unlike to these previous methods, here, a linear formulation is achieved byintroducing auxiliary variables. In turn with this linear formulation a computa-tionally efficient method is defined.

Section 2 gives a formal description of the instantaneous optical flow model.This model serves as basis to derive our method outlined in Section 3. An eval-uation of the new method in different scenarios and in comparison to existingapproaches is given in Section 4. Finally, Section 5 discusses our method in thecontext of existing approaches [3,9,11,20,16,18] and Section 6 gives a conclusion.

2 Model of Instantaneous Ego-Motion

Von Helmholtz and Gibson introduced the definition of optical flow as movingpatterns of light falling upon the retina [10,8]. Following this definition Longuet-Higgins and Prazdny gave a formal description of optical flow which is basedon a model of instantaneous ego-motion [13]. In their description they used apinhole camera with the focal length f which projects 3-d points (X,Y, Z) ontothe 2-d image plane, formally (x, y) = f/Z (X,Y ). Ego-motion composed ofthe translation T = (tx, ty, tz)t and rotation R = (rx, ry , rz)t causes the 3-d in-stantaneous displacement

(X Y Z

)t= ( tx ty tz )t( rx ry rz )t(X Y Z )t ,

where dots denote the first temporal derivative and t the transpose operator. Us-ing this model, movements of projected points on the 2-d image plane have thevelocity

V :=(uv

)=

1Z

(f 0 x0 f y

)T +

1f

(xy (f2 + x2) fy

(f2 + y2) xy fx)R. (1)

3 Linear Method for Ego-Motion Estimation

Input flow, e.g. estimated from a spatio-temporal image sequence, is denoted byV , while the model flow is defined as in Equation 1. Now, the problem is to find

Estimation of Ego-Motion 13

parameters of themodel flow which describe the given flow V best. Namely, theseparameters are the scenic depth Z, the translation T and rotation R. Based onEquation 1 many researchers studied non-linear optimization problems to esti-mate ego-motion [3,20,4,18]. Moreover, most of these methods have a statisticalbias which means that methods produce systematic errors considering isotropicnoisy input [14,16]. Unlike these approaches we suggest a new linearized formbased on Equation 1 and show how to solve this form computationally efficientwith a new method. Further, this method can be unbiased. The new methodis derived in three consecutive steps: (i) the algebraic transformation of Equa-tion 1 which is independent of depth Z, (ii) a formulation of an optimizationproblem for translation and auxiliary variables, and (iii) the removal of a sta-tistical bias. The calculation of rotation R with translation T known is then asimple problem.

Depth independent constraint equation. Bruss and Horn formulated anoptimization problem with respect to depth Z which optimizes the squared Eu-clidian distance of the residual vector between the input flow vector V = (u, v)t

and the model flow vector V defined in Equation 1. Inserting the optimized depthinto Equation 1 they derived the so called bilinear optimization constraint. Analgebraic transformation of this constraint is

0 =

txtytz

( f vfu

yu xv

=:M

(f2 + y2) xy fxxy (f2 + x2) fy

fx fy (x2 + y2)

=:H

rxryrz

),(2)

which Heeger and Jepson describe during the course of their subspace construc-tion. In detail, they use a subspace which is orthogonal to the base polynomialdefined by entries of the matrix H(xi, yi)i=1..m, where m denotes the finite num-ber of constraints employed [9].

Optimization of translation. Only a linear independent part of the basepolynomial H is used for optimization. We chose the upper triangular matrixtogether with the diagonal of matrix H . These entries are summarized in thevector E := ((f2+ y2), xy, fx,(f2+x2), fy,(x2+ y2))t. To achieve a linearform of Equation 2 auxiliary variables (txrx, txry , txrz , ty ry, ty rz , tz rz)t := Kare introduced. With respect to E and K the linear optimization problem

F (V ;T,K(T )) :=x

[T tM +KtE]2 dxT,K(T ) min, (3)

is defined, integrating constraints over all locations x = (x, y) x 2of the image plane. This image plane is assumed to be continuous and finite.Calculating partial derivatives of F (V ;T,K(T )) and equating them to zero leadsto the linear system of equations


0 =x

[T tM +KtE] Et dx, (4)

0 =x

[T tM +KtE] [M + (KtE)

T]t dx, (5)

consisting of nine equations and nine variables in K and T . Solving Equation 4with respect to K and inserting the result as well as the partial derivative for theargument T of expression K into Equation 5 results in the homogenous linearsystem of equations

0 = T tx

LiLj dx =: T tC, i, j = 1..3, with (6)

Li := Mi (DE)tx

EMi dx, i = 1..3 and D := [x

EEt dx]1 66.

A robust (non-trivial) solution for such a system is given by the eigenvectorwhich corresponds to the smallest eigenvalue of the 3 3 scatter matrix C [3].Removal of statistical bias. All methods which are based on the bilinearconstraint given in Equation 2 are statistically biased [9,11,14,18]. To calculatethis bias we define an isotropic noisy input by the vector V := (u, v) + (nu, nv),with components nu and nv N ( = 0,) normally distributed. A statisticalbias is inferred by studying the expectation value < > of the scatter matrix C.This scatter matrix is defined inserting the noisy input flow V into Equation 6.This gives < C >=< C > +2N with

N =

f 0 f < x >0 f f < y >f < x > f < y > < (x2 + y2) >

, (7)using the properties < nu >=< nv >= 0 and < n2u >=< n

2v >=

2. Severalprocedures to remove the bias term 2N have been proposed. For example,Kanatani suggested a method of renormalization subtracting the bias term onthe basis of an estimate of 2 [11]. Heeger and Jepson used dithered constraintvectors and defined a roughly isotropic covariance matrix with these vectors.MacLean used a transformation of constraints into a space where the influenceby noise is isotropic [14]. Here, the last approach is used, due to its computationalefficiency. In a nutshell, to solve Equation 6 considering noisy input we calculatethe eigenvector which corresponds to the smallest eigenvalue of matrix C. Pre-withening of the scatter matrix C gives C := N

12 CN

12 . Then the influence by

noise is isotropic, namely 2I, where I denotes a 3 3 unity matrix. The newlydefined eigenvalue problem Cx = ( + 2)x preserves the ordering of andeigenvectors N

12 x compared to the former eigenvalue problem Cx = x. Then

the solution is constructed with the eigenvector of matrix C which correspondsto the smallest eigenvalue. Finally, this eigenvector has to be multiplied by N

12 .


4 Results

To test the proposed method for ego-motion estimation in different configura-tions we use two sequences, the Yosemite sequence1 and the Fountain sequence2.In the Yosemite sequence a flight through a valley is simulated, specified byT = (0, 0.17, 0.98)34.8px and R = (1.33, 9.31, 1.62)102 deg/frame [9]. In theFountain sequence the curvilinear motion with T = (0.6446, 0.2179, 2.4056)and R = (0.125, 0.20, 0.125)deg/frame is performed. The (virtual) cameraemployed to gather images has a vertical field of view of 40deg and a resolutionof 316252 for the Yosemite sequence and 320240 for the Fountain sequence.All methods included in our investigation have a statistical bias which is removedwith the technique of MacLean [14]. The iterative method of Pauwels and VanHulle [18] employs a fix-point iteration mechanism using a maximal numberof 500 iterations and 15 initial values for the translation direction, randomlydistributed on the positive hemisphere [18].

Numerical stability. To show numerical stability we use the scenic depth ofthe Fountain sequence (5th frame) with a quarter of the full resolution to testdifferent ego-motions. These ego-motions are uniformly distributed in the rangeof 40deg azimuth and elevation in the positive hemisphere. Rotational com-ponents for pitch and yaw are calculated by fixating the central point and com-pensating translation by rotation. An additional roll component of 1deg/frameis superimposed. With scenic depth values and ego-motion given, optical flowis calculated by Equation 1. This optical flow is systematically manipulated byapplying two different noise models: a Gaussian and an outlier noise model. TheGaussian noise model was former specified in Section 3. In the outlier noisemodel a percentage, denoted by , of all flow vectors are replaced by a randomlyconstructed vector. Each component of this vector is drawn from a uniformlydistributed random variable. The interval of this distribution is defined by thenegative and positive of the mean length of all flow vectors. Outlier noise mod-els sparsely distributed gross errors, e.g. caused by correspondences that wereincorrectly estimated. Applying a noise model to the input flow, the estimationof ego-motion becomes erroneous. These errors are reported by: (i) the angulardifference between two translational 3-d vectors, whereas one is the estimatedvector and the other the ground-truth vector, and (ii) the absolute value of thedifference for each rotational component. Again, differences are calculated be-tween estimate and ground-truth. Figure 1 shows errors of ego-motion estimationapplying the Gaussian and the outlier noise model. All methods show numericalstability, whereas the mean translational error is lower than approximately 6degfor both noise models. The method of Pauwels and Van Hulle performs bestcompared to the other methods. Better performance is assumed to be achievedby employing numerical fix-point iteration with different initial values randomlychosen within the search space.

1 Available via anonymous ftp from ftp.csd.uwo.ca in the directory pub/vision.2 Provided at http://www.informatik.uni-ulm.de/ni/mitarbeiter/FRaudies.


a) b) c) d)

0 5 100

1

2

3

4

5

6

Gaussian noise [%]

mea

n of

ang

ular

err

or [

]

0 5 100

1

2

3

4

5

6

Gaussian noise [%]

std

of a

ngul

ar e

rror

[]

0 20 400

1

2

3

4

5

6

outlier noise [%]

mea

n of

ang

ular

err

or [

]

0 20 400

1

2

3

4

5

6

outlier noise [%]

std

of a

ngul

ar e

rror

[]

proposed methodKanatani, 1993Pauwels & Van Hulle, 2006

Fig. 1. All methods employed show numerical stability in the presence of noise dueto small translational and rotational errors (not shown). In detail, a) shows the meanangular error for Gaussian noise and c) for outlier noise. Graphs in b) and d) showthe corresponding standard deviation, respectively. The parameter is specified withrespect to the image height. Mean and standard deviation are calculated for a numberof 50 trials.

Table 1. Errors for estimated optical and ground-truth input flow of the proposedmethod. In case of the Yosemite sequence which contains the independently movingcloudy sky the RANSAC paradigm is employed which improves ego-motion estimates(50 trials, mean and standard deviation shown). (x) denotes the angle calculatedbetween estimated and ground-truth 3-d translational vectors.

translation rotationsequence (T est, T gt)(x) [deg] |rx| [deg] |ry| [deg] |rz| [deg]

estimated optical flow; Brox et al. [2]; 100% densityFountain 4.395 0.001645 0.0286 0.02101Yosemite 4.893 0.02012 0.1187 0.1153

estimated optical flow; Farnebaeck [6]; 100% densityFountain 6.841 0.01521 0.05089 0.025Yosemite 4.834 0.03922 0.00393 0.07636

estimated optical flow; Farnebeack, [6]; 25% densityFountain 1.542 0.0008952 0.01349 0.003637Yosemite 1.208 0.007888 0.01178 0.02633Yosemite (RANSAC) 1.134 0.01261 0.008485 0.02849

0.2618 0.002088 0.002389 0.003714ground-truth optical flow; 25% of full resolution

Fountain 0.0676 0.000259 8.624e-006 0.0007189Yosemite 5.625 0.02613 0.1092 0.06062Yosemite (RANSAC) 1.116 0.01075 0.004865 0.02256

1.119 0.01021 0.006396 0.009565

Estimated optical flow as input. We test our method on the basis of opticalinput flow estimated by two different methods.

First, we utilize the tensor-based method of Farnebaeck together with anaffine motion model [6] to estimate optical flow. The spatio-temporal tensoris constructed by projecting the input signal to a set of base polynomials offinite Gaussian support ( = 1.6px and length l = 11px, = 1/256). Spatial


averaging of the resulting components of the tensor is performed with a Gaussianfilter ( = 6.5px and l = 41px).

Second, optical flow is estimated with the affine warping technique of Broxet al. [2]. Here, we implemented the 2-d version of the algorithm and used thefollowing parameter values, = 200, = 100, = 0.001, = 0.5, = 0.95, anumber of 77 outer fix point iterations and 10 inner fix point iterations. To solvepartial differential equations the numerical method Successive Over-Relaxation(SOR) with parameter = 1.8 and 5 iterations is applied.

Errors of optical flow estimation are reported by a 3-d mean angular errorwhich was defined by Barron and Fleet [1]. According to this angular erroroptical flow is estimated for frame pair 8 9 (starting to count from index 0) ofthe Yosemite sequence, with 5.41deg accuracy for the method of Farnebaeck andwith 3.54deg for the method of Brox. In case of frame pair 56 of the Fountainsequence the mean angular error is 2.49deg estimating flow with Farnebaecksmethod and 2.54 deg for the method of Brox. All errors refer to a density of100% for optical flow data.

Table 1 lists errors of ego-motion estimation for different scenarios.Comparingthe first two parts of the table, we conclude that a high accuracy for opticalflow estimates does not necessarily provide a high accuracy in the estimationof ego-motion. In detail, the error of ego-motion estimation depends on theerror characteristic (spatial distribution and value of errors) within the estimatedoptical flow field. However, this characteristic is not expressed by the meanangular error. One way to reduce the dependency on the error characteristicis to reduce the data set, leaving out data points which are most erroneous.Generally, this requires (i) an appropriate confidence measure to evaluate thevalidity or reliability of data points, (ii) and a strategy to avoid linear dependencyin the resulting data w.r.t. ego-motion estimation. Farnebaeck describes how tocalculate a confidence value within his thesis [6]. Here, this confidence is usedto thin out flow estimates, whereas we retain 25% of all estimates, enough toavoid linear dependency for our configurations. For ego-motion estimation errorsare then reduced as can be observed in the third part of Table 1. In case of theYosemite sequence sparsification has a helpful side effect. The cloud motion isestimated by the method of Farnebaeck with low accuracy and confidence. Thus,no estimates corresponding from the cloudy sky are contained in the data set forthe estimation of ego-motion. In the last part of Table 1 ground-truth optical flowis utilized to estimate ego-motion. In this case, the cloudy sky is present in thedata set and thus deflects estimates of ego-motion, e.g. the translational angularerror amounts 5.6deg. To handle IMOs we use the RANSAC algorithm [7]. Ina nutshell, the idea of the algorithm is to achieve an estimate which is basedon non erroneous data points only. Therefore, initial estimates are performedon different randomly selected subsets of all data points, which are tried to beenlarged by other non erroneous data points. The algorithm stops if an estimateis found, that is based on a data set of a certain cardinality. For the ground-truth flow of the Yosemite sequence, this method is successfully in estimatingego-motion, now the translational angular error amounts 1.116deg (mean value).


5 Discussion

A novel linear optimization method was derived to solve the segregation of thetranslational and rotational component, one of the main problems in computa-tional ego-motion estimation [3,9,13].

Related work. A well-known linear method for ego-motion estimation is thesubspace method [9]. Unlike our method a subspace independent of the rotationalpart was used by Heeger and Jepson for the estimation of translation, using onlym 6 of m constraints. However, in the method proposed here, all constraintsare used which leads to more robust estimates. Zhuang et al. formulated a linearmethod for the segregation of translation and rotation employing the instanta-neous motion model together with the epipolar constraint [21]. They introducedauxiliary variables, as superposition of translation and rotation, then optimizedw.r.t. these variables and translation. In a last step they reconstructed rotationfrom auxiliary variables. Unlike their method we used the bilinear constraintfor optimization, defined auxiliary variables differently, split up the optimiza-tion for rotation and translation and finally had to solve only a 3 3 eigenvalueproblem for translation estimation, instead of a 99 eigenvalue problem in caseof Zhuangs approach. Moreover, applying this different optimization strategyallowed us to incorporate the method of MacLean to remove a statistically bias,which is not the case for the method of Zhuang.

Complexity. To achieve real-time capability in applications a low computa-tionally complexity is of vital need. Existing methods for the estimation of ego-motion have a higher complexity than our method (compare with Table 2).For example [9] employs a singular value decomposition for a m 6 matrix,or iterative methods to solve for nonlinear optimization problems are employed[4,18,20]. Comparable to our method in case of computational complexity, isthe method of Kanatani [11]. Unlike our approach this method is based on theepipolar constraint.

Numerical stability.We showed that the optimizationmethod is robust againstnoise, compared to other ego-motion algorithms [11,18]. Furthermore, the tech-nique of pre-whitening is applied to our method to remove a statistical bias

Table 2. Average (1000 trials) computing times [msec] of methods estimating ego-motion, tested with a C++ implementation on a Windows XP platform, Intel Core 2Duo T9300. () This algorithm employs a maximal number of 500 iterations and 15initial values.

method number of vectors25 225 2.025 20.164 80.089

new proposed method (unbiased) 0.05 0.06 0.34 4.56 22.16Kanatani (unbiased) 0.03 0.11 0.78 7.56 29.20Heeger & Jepson (unbiased) 0.08 2.44 399.20 n.a. n.a.Pauwels & Van Hulle, 2006 (unbiased)() 0.16 0.81 6.90 66.87 272.95


as well. This technique was proposed by MacLean [14] for bias removal in thesubspace algorithm of Heeger and Jepson [9] and the method by Pauwels andVan Hulle for the fix-point iteration, iterating between coupled estimates fortranslation and rotation of ego-motion [17]. Unlike other unbiasing techniquesMacLeans technique needs neither an estimate of the noise characteristic noran iterative mechanism. With statistical bias removed methods are consistent inthe sense of Zhang and Tomasis definition of consistency [20].

Outlier detection. To detect outliers in ego-motion estimation, in particularIMOs, several methods were suggested, namely frameworks employing the EMalgorithm [15,5], the Collinear Point Constraint [12] and the RANSAC algo-rithm [19]. In accordance to the conclusion of Torrs thesis, who found that theRANSAC algorithm performs best in motion segmentation and outlier detection,we chose RANSAC to achieve robust ego-motion estimation.

6 Conclusion

In summary, we have introduced a novel method for the separation of translationand rotation in the computation of ego-motion. Due to the simplicity of themethod it has a very low computational complexity and is thus faster thanexisting estimation techniques (Table 2). First, we tested our method with acomputed optical flow field, where ego-motion can be estimated exactly. Undernoisy conditions results show numerical stability of the optimization methodand its comparability with existing methods for the estimation of ego-motion.In more realistic scenarios utilizing estimated optical flow, ego-motion can beestimated with high accuracy. Future work will employ temporal integration ofego-motion estimates within the processing of an image sequences. This shouldstabilize ego-motion and optical flow estimation counting on the spatio-temporalcoherence of the visually observable world.

Acknowledgements

Stefan Ringbauer kindly provided a computer graphics ray-tracer utilized togenerate images and ground-truth flow for the Fountain sequence. This researchhas been supported by a scholarship given to F.R. from the Graduate Schoolof Mathematical Analysis of Evolution, Information and Complexity at UlmUniversity.

References

1. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques.Int. J. of Comp. Vis. 12(1), 4377 (1994)

2. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow esti-mation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004.LNCS, vol. 3024, pp. 2536. Springer, Heidelberg (2004)


3. Bruss, A.R., Horn, B.K.P.: Passive navigation. Comp. Vis., Graph., and Im.Proc. 21, 320 (1983)

4. Chiuso, A., Brockett, R., Soatto, S.: Optimal structure from motion: Local ambi-guities and global estimates. Int. J. of Comp. Vis. 39(3), 195228 (2000)

5. Clauss, M., Bayerl, P., Neumann, H.: Segmentation of independently moving ob-jects using a maximum-likelihood principle. In: Lafrenz, R., Avrutin, V., Levi, P.,Schanz, M. (eds.) Autonome Mobile Systeme 2005, Informatik Aktuell, pp. 8187.Springer, Berlin (2005)

6. Farnebaeck, G.: Polynomial expansion for orientation and motion estimation. PhDthesis, Dept. of Electrical Engineering, Linkoepings universitet (2002)

7. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for modelfitting with applications to image analysis and automated cartography. Comm. ofthe ACM 24(6), 381395 (1981)

8. Gibson, J.J.: The Perception of the Visual World. Houghton Mifflin, Boston (1950)9. Heeger, D.J., Jepson, A.D.: Subspace methods for recovering rigid motion i: Algo-

rithm and implementation. Int. J. of Comp. Vis. 7(2), 95117 (1992)10. Helmholtz, H.: Treatise on physiological optics. In: Southhall, J.P, (ed.) (1925)11. Kanatani, K.: 3-d interpretation of optical-flow by renormalization. Int. J. of Comp.

Vis. 11(3), 267282 (1993)12. Lobo, N.V., Tsotsos, J.K.: Computing ego-motion and detecting independent mo-

tion from image motion using collinear points. Comp. Vis. and Img. Underst. 64(1),2152 (1996)

13. Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image.Proc. of the Royal Soc. of London. Series B, Biol. Sci. 208(1173), 385397 (1980)

14. MacLean, W.J.: Removal of translation bias when using subspace methods. IEEEInt. Conf. on Comp. Vis. 2, 753758 (1999)

15. MacLean, W.J., Jepson, A.D., Frecker, R.C.: Recovery of egomotion and segmen-tation of independent object motion using the EM algorithm. Brit. Mach. Vis.Conf. 1, 175184 (1994)

16. Pauwels, K., Van Hulle, M.M.: Segmenting independently moving objects from ego-motion flow fields. In: Proc. of the Early Cognitive Vision Workshop (ECOVISION2004), Isle of Skye, Scotland (2004)

17. Pauwels, K., Van Hulle, M.M.: Robust instantaneous rigid motion estimation. Proc.of Comp. Vis. and Pat. Rec. 2, 980985 (2005)

18. Pauwels, K., Van Hulle, M.M.: Optimal instantaneous rigid motion estimationinsensitive to local minima. Comp. Vis. and Im. Underst. 104(1), 7786 (2006)

19. Torr, P.H.S.: Outlier Detection and Motion Segmentation. PhD thesis, EngineeringDept., University of Oxford (1995)

20. Zhang, T., Tomasi, C.: Fast, robust, and consistent camera motion estimation.Proc. of Comp. Vis. and Pat. Rec. 1, 164170 (1999)

21. Zhuang, X., Huang, T.S., Ahuja, N., Haralick, R.M.: A simplified linear optic flow-motion algorithm. Comp. Graph. and Img. Proc. 42, 334344 (1988)

Localised Mixture Models in Region-BasedTracking

Christian Schmaltz1, Bodo Rosenhahn2, Thomas Brox3,and Joachim Weickert1

1 Mathematical Image Analysis Group, Faculty of Mathematics and ComputerScience, Building E1 1 Saarland University, 66041 Saarbrucken, Germany

{schmaltz,weickert}@mia.uni-saarland.de2 Leibniz Universitat Hannover 30167 Hannover, Germany

[email protected] University of California, Berkeley, CA, 94720, USA

[email protected]

Abstract. An important problem in many computer vision tasks is theseparation of an object from its background. One common strategy is toestimate appearance models of the object and background region. How-ever, if the appearance is spatially varying, simple homogeneous modelsare often inaccurate. Gaussian mixture models can take multimodal dis-tributions into account, yet they still neglect the positional information.In this paper, we propose localised mixture models (LMMs) and evaluatethis idea in the scope of model-based tracking by automatically parti-tioning the fore- and background into several subregions. In contrastto background subtraction methods, this approach also allows for mov-ing backgrounds. Experiments with a rigid object and the HumanEva-IIbenchmark show that tracking is remarkably stabilised by the new model.

1 Introduction

In many image processing tasks such as object segmentation or tracking, it isnecessary to distinguish between the region of interest (foreground) and its back-ground. Common approaches, such asMRFs or active contours build appearancemodels of both regions with their parameters being learnt either from a-prioridata or from the images [1,2,3]. Various types of features can be used to buildthe appearance model. Most common are brightness and colour, but any densefeature set such as texture descriptors [4] or motion [5] can be part of the model.

Apart from the considered features, the statistical model of the region is oflarge interest. In simple cases, one assumes a Gaussian distribution in each re-gion. However, since usually object regions change their appearance locally, sucha Gaussian model is too inaccurate. A typical example is the black and whitestripes of a zebra, which leads to a Gaussian distribution with a grayish mean

We gratefully acknowledge funding by the German Research Foundation (DFG)under the project We 2602/5-1.


22 C. Schmaltz et al.

(d)(c)(a) (b)

Fig. 1. Left: Illustrative examples of situations where object (to be further specified bya shape prior) and background region are not well modelled by identically distributedpixels. In (a), red points are more likely in the background. Thus, the hooves of thegiraffe will not be classified correctly. In (b), the dark hair and parts of the body aremore likely to belong to the background. Localised distributions can model these casesmore accurately. Right: Object model used by the tracker in one of our experiments(c) and decomposition of the object model into three different components (d), asproposed by the automatic splitting algorithm from [6]. There are 22 joint angles inthe model, resulting in a total of 28 parameters that must be estimated.

that does neither describe the black nor the white part very well. In order to dealwith such cases, Gaussian mixture models or kernel density models have beenproposed. These models are much more general, yet still impose the assump-tion of identically distributed pixels in each region, i.e., they ignore positionalinformation. The left part of Fig. 1 shows two examples where this is insufficient.

In contrast, a model which is sensitive for the location in the image was pro-posed in [7]. The region statistics are estimated for each point separately, therebyconsidering only information from the local neighbourhood. Consequently, thedistribution varies smoothly within a region. A similar local statisticalmodel wasused in [8]. A drawback of this model is that it blurs across discontinuities insidethe region. As the support of the neighbourhood needs to be sufficiently largeto reliably estimate the parameters of local distributions, this blurring can bequite significant. This is especially true when using local kernel density models,which require more data than a local Gaussian model.

The basic idea in the present paper is to segment the regions into subregionsinside which a statistical model can be estimated. Similar to the above localregion statistics, the distribution model integrates positional information. Thesupport for estimating the distribution parameters is usually much larger as itconsiders all pixels from the subregion, though. Splitting the background intosubregions and employing a kernel density estimator in each of those allows fora very precise region model relying on enough data for parameter estimation.

Related to this concept are Gaussian mixture models in the context of back-ground subtraction. Here, the mixture parameters are not estimated in a spatialneighbourhood but from data along the temporal axis. This leads to modelswhich include very accurate positional information [9]. In [10], an overview ofseveral possible background models ranging from very simple to complex models

Localised Mixture Models in Region-Based Tracking 23

is given. The learned statistics from such models can also be combined with aconventional spatially global model as proposed in [11]. For background subtrac-tion, however, the parameters are learned in advance, i.e., a background imageor images with little motion and without the object must be available. Suchlimitations are not present in our approach. In fact, our experiments show thatbackground subtraction and the proposed localised mixture model (LMM) arein some sense complementary and can be combined to improve results in track-ing. Also note that, in contrast to image labelling approaches that also split thebackground into different regions such as [12], no learning step is necessary.

A general problem that arises when making statistical models more and moreprecise is the increasing amount of local optima in corresponding cost functions.In Fig. 1 there is actually no reason to put the red hooves to the giraffe regionor the black hair to the person. A shape prior and/or a close initialisation of thecontour is required to properly define the object segmentation problem. For thisreason we focus in this paper on the field of model based tracking, where botha shape model and a good initial separation into foreground and backgroundcan be derived from the previous frame. In particular, we evaluated the modelin silhouette-based 3-D pose tracking, where pose and deformation parametersof a 3-D object model are estimated such that the image is optimally split intoobject and background [13,6]. The model is generally applicable to any othercontour-based tracking method as well. Another possible field of application issemi-supervised segmentation, where the user can incrementally improve thesegmentation by manually specifying some parts of the image as foreground orbackground [1]. This can resolve above ambiguities as well.

Our paper is organised as follows: We first review the pose tracking approachused for evaluation. We then explain the localised mixture model (LMM) inSection 3. While the basic approach only works with static background images,we remove this restriction later in a more general approach. After presentationof our experimental data in Section 4, the paper is concluded in Section 5.

2 Foreground-Background Separation in Region-BasedPose Tracking

In this paper, we focus on tracking an articulated free-form surface consisting ofrigid parts interconnected by predefined joints. The state vector consists of theglobal pose parameters (3-D shift and rotation) as well as n joint angles, similarto [14]. The surface model is divided into l different (not necessarily connected)components Mi, i = 1, . . . , l, as illustrated in Fig. 1. The components are chosensuch that each component has a uniform appearance that differs from othercomponents, as proposed in [6]. There are many more tracking approaches thanthe one presented here. We refer to the surveys [15,16] for an overview.

Given an initial pose, the primary goal is to adapt the state vector such thatthe projections of the object parts lead to maximally homogeneous regions inthe image. This is stated by the following cost function which is sought to beminimised in each frame:


(a) (b) (c) (d) (e)

Fig. 2. Example of a background segmentation. From left to right: (a) Backgroundimage. (b,c) K-means clustering with three and six clusters. (d,e) Level set segmentationwith two different parameter settings.

E() =l

i=0

(vi(, x)Pi,(x) log pi,(x)

)dx, (1)

where denotes the image domain. The appearance of each component i andof the background (i = 0) is modelled by a probability density function (PDF)pi, i 0, . . . , l. The PDFs of the object parts are modelled as kernel densities,whereas we will use the LMM for modelling the background as explained later.

Pi, is the indicator function for the projection of the i-th component Mi, i.e.Pi,(x) is 1 if and only if a part of the object with pose is projected to the imagepoint x. In order to take occlusion into account, vi(, x) : R6+n {0, 1} isa visibility function that is 1 if and only if the i-th object part is not occludedby another part of the object in the given pose. Visibility can be computedefficiently using openGL.

The cost function can be minimised locally by a modified gradient descent.The PDFs are evaluated at silhouette points xi of each projected model compo-nents. These points xi are thenmoved along the normal direction of the projectedobject, either towards or away from the components, depending on which of theregions PDF fits better at that particular point. The point motion is trans-ferred to the corresponding change of the state vector by using a point-basedpose estimation algorithm as described, e.g., in [7].

3 Localised Mixture Models

In the above approach, the object region is very accurately described by the ob-ject model, which is split into various parts that are similar in their appearance.Hence, the local change of appearance within the object region is taken well intoaccount. The background region, however, consists of a single appearance modeland positional changes of this appearance are so far neglected.

Consider a red-haired person standing on a red carpet which is facing thecamera. Then, only a very small part of the person is red, compared to a largepart of the background. As a larger percentage of pixels lying outside the personare red, red pixels will be classified to belong to the outside regions. Thus,the hair will be considered as not being part of the object, which deterioratestracking. This happens despite the fact that the carpet is far away from thehair.


The idea to circumvent this problem is to separate the background into mul-tiple subregions each of which is modelled by its own PDF. This can be regardedas a mixture of PDFs, yet the mixture components exploit the positional infor-mation telling where the separate mixture components are to be applied.

3.1 Case I: Static Background Image Available

If a static background image is available, segmenting the background is quitesimple. In contrast to the top-level task of object-background separation, theregions need not necessarily correspond to objects in the scene. Hence, virtuallyanymulti-region segmentation technique can be applied for this. We tested a verysimple one, the K-means algorithm [17,18], and a more sophisticated level setbased segmentation, which considers multiple scales and includes a smoothnessprior on the contour [19]. In the K-means algorithm the number of clustersis fixed, whereas the level set approach optimises the number of regions by ahomogeneity criterion, which is steered by a tuning parameter. Thus, the numberof subregions can vary.

Fig. 2 compares the segmentation output of these two methods for two differ-ent parameter settings. The results with the level set method are much smootherdue to the boundary length constraint. In contrast, the regions computed withK-means have more fuzzy boundaries. This can be disadvantageous, particularlywhen the localisation of the model is not precise due to a moving background asconsidered in the next section.

After splitting the background image into subregions, a localised PDF canbe assembled from PDFs estimated in each subregion j. Let L(x, y) denote thelabelling obtained by the segmentation, we obtain the density

p(x, y, s) = pL(x,y)(s), (2)

where s is any feature used for tracking. It makes most sense to use the samedensity model for the subregions as used in the segmentation method. In case ofK-means this means that we have a Gaussian distribution with fixed variance:

pkmeansj (s) exp( (s j)

2

2

), (3)

where j is the cluster centre of cluster j. The level set based segmentationmethod is build upon a kernel density estimator

plevelsetj (s) = K

(x,y)j (s, I(x, y))

|j | (4)

where is the Dirac delta distribution andK is a Gaussian kernel with standarddeviation . Here, we use =

30. The PDF in (2) can simply be plugged into

the energy in (1). Note that this PDF needs to be estimated only once for thebackground image and then stays fixed, whereas the PDFs of the object partsare reestimated in each frame to account for the changing appearance.


3.2 Case II: Potentially Varying Background

For some scenarios, generating a static background image is not possible. Inoutdoor scenarios, for example, the background usually changes due to movingplants or people passing by. Even inside buildings, the lighting conditions and thus the background typically vary. Furthermore, the background couldvary due to camera motion. In fact, varying backgrounds can appear in manyapplications and render background subtraction methods impossible.

In general, the background changes only slowly between two consecutiveframes. This can be exploited to extend the described approach to non-staticimages or to images where the object is already present. Excluding the currentobject region from the image domain the remainder of the image can be seg-mented as before. This is shown in Fig. 5. To further deal with slow changes inthe background, the segmentation can also be recomputed in each new frame.This takes changes in the localisation or in the statistics into account.

A subtle difficulty appearing in this case is that there may be parts of thebackground not available in the density model because these areas were occludedby the object in the previous frame. When reestimating the pose parametersof the object model, the previously occluded part can appear and needs sometreatment. In such a case we choose the nearest available neighbour and usethe probability density of the corresponding subregion. That is, if j is the jthsubregion as computed by the segmentation step, the local mixture density is:

p(x, y, s) = pj(x, y) with j = argminj

(dist((x, y), j)) . (5)

4 Experiments

We evaluated the described region statistics on sequence S4 of the HumanEva-II benchmark [20]. For this sequence, a total of four views as well as staticbackground images are available. Thus, this sequence allows us to compare thevariant that uses a static background image to the version without the need forsuch an image. The sequence shows a man walking in a circle for approximately370 frames, followed by a jogging part from frame 370 to 780, and finally abalancing part until frame 1200.Ground truth marker data is available for thissequence and tracking errors can be evaluated via an online interface providedby Brown University. Note that the ground truth data between frame 299 and334 is not available, thus this part is ignored in the evaluation. In the figures,we plotted a linear interpolation between frame 298 and 335.

Table 1 shows some statistics over tracking results with different models. Thefirst line in the table shows an experiment in which background subtraction wasused to find approximate silhouettes of the person to be tracked. These silhouetteimages are used as additional features, i.e. in addition to the three channels ofthe CIELAB colour space, for computing the PDFs of the different regions. Thisapproach corresponds to the one in [6]. Results are improved when using theLMM based on level set segmentation. This can be seen by comparing the first


1

2 332 1

132

1 2

323

1 23

1

Fig. 3. PDFs estimated for the CIELAB colour channels of the subregions shown inFi

[lecture notes in computer science] pattern recognition volume 5748 ||

Documents