solving the multi object occlusion problem in a multiple camera tracking system

ISSN 1054-6618, Pattern Recognition and Image Analysis, 2009, Vol. 19, No. 1, pp. 165–171. © Pleiades Publishing, Ltd., 2009.

Solving the Multi Object Occlusion Problemin a Multiple Camera Tracking System

M. Mozerov

a

,

b

, A. Amato

a

, X. Roca

a

, and J. González

a

a

Computer Vision Center and Department d’Informática, Universitat Autónoma deBarcelona,08193 Cerdanyola, Spain

e-mail: [email protected]

b

Digital Optics Laboratory, Institute for Problems of Information Transmission, Russian Academy of Sciences,per. Bol’shoi Karetnyi 19, Moscow, 101447 Russia

e-mail: [email protected]

Abstract

—An efficient method to overcome adverse effects of occlusion upon object tracking is presented. Themethod is based on matching paths of objects in time and solves a complex occlusion-caused problem of merg-ing separate segments of the same path.

DOI:

10.1134/S105466180901026X

Received February 27, 2008

1. INTRODUCTION

Movement analysis of humans in 3D scenes stillpresents a great challenge to computer vision systems[1, 2], and person tracking plays an important part insuch analysis. Different approaches to solving thisproblem can be found in literature [3–5]. Tracking canbe significantly complicated by multiple uncontrollablefactors, and breaking of paths in time is especially hardto foresee. For example, a cognitive system designed toanalyze human behavior will not be able to accuratelyinterpret visual events unless movement of each personis continuously followed. In other words, every personmust have its own track. It is well-known that the prob-lem of tracking can be solved more accurately when thescene is observed from several vantage points. In thispaper we show how ambiguities caused by occlusioncan be resolved by employing a multi-camera visualsystem.

Handling occlusions is intricate and this problem isconsidered in many papers [6–8]. Most often it istreated as a problem of classification. Zhou and Aggar-wal reduce the problem of tracking to that of identify-ing and matching elliptic object segments in time [9].This requires determining the pose of video cameras.Conventional approaches to camera calibration mostlyrely on linear methods [10–12]. In this work, we find itmore suitable to calibrate cameras using just threepoints, which affords greater flexibility in the case ofcamera image rotation and scaling [8]. Therefore, cam-era calibration can be accomplished in two independentsteps: (i) determining coordinates of three points in thereference system of the camera to be calibrated; and(ii) computing the rotation matrix

R

and the translationvector

T

.

This paper is concerned primarily with long occlu-sions, which are often handled by resorting to pattern rec-ognition and other complex methods of object identifica-tion [14]. For a multi-camera system, it can be naturallyassumed that a moving object is viewed by at least onecamera. Therefore, there is no need to employ computa-tionally intricate recognition methods. Another advantageof this approach is that the process of tracking is given aformal description, which can be applied to an arbitraryscene. The problem of tracking is stated as optimizationof a clearly defined criterion function expressed in termsof distances between path points in the scene plane.

2. LOCAL TRACKINGAND COLOR SEGMENTATION

The first step of tracking is image segmentation.This is accomplished using stable background subtrac-tion, described in detail elsewhere [15]. The result ofsegmentation is pictured in Fig. 1b. After the segmenta-tion step, the position and the size of an ellipse fitted tothe segmented object is computed. In this work, the bot-tom-most point of such an ellipse is called a groundpoint. For example, Fig. 1b shows two ground pointswith image coordinates

u

1

and

u

2

. If a segment has nohistory (i.e., no close ellipse exists in the previousframe), the parameter values of the new object repre-sented by an ellipse are initialized by referring to thetraits of the corresponding segment. All previously cre-ated ellipse-objects are matched against the newlyobtained segments even if this gives rise to partially orcompletely occluded objects. This approach allows cir-cumventing the ambiguity caused by associating onesegment with several objects. Suppose that

T

frames

t

∈

1, …,

T

are considered and let

N

(

t

)

n

∈

1, …,

N

(

t

)be the number of ground points found in each frame. Inthis case the local (in time) algorithm creates a set ofpoints

u

n

(

t

) and each point must be assigned to one of

APPLIEDPROBLEMS

166

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 19

No. 1

2009

MOZEROV et al.

the paths. Note that points

u

n

(

t

) are projections of real-world points and, to be able to measure separationsbetween these points, the obtained image points have toinversely projected onto the scene plane in the real world.

3. DATA TRANSFORMATION

In order to be able to transform coordinates inimages captured from different vantage points to a glo-bal reference frame associated with the scene plane, thecameras have to be calibrated. In our geometric infer-ences, we use a perspective camera model explained inFig. 2. The reference frame of the camera is representedby (

p

0

,

X

c

,

Y

c

,

Z

c

), where

p

0

is its origin (coinciding withthe optical center of the video camera); the

Z

c

axis isaligned with the camera optical axis but is directedbackwards. The subscripts

c

,

w

, and

n

are interpreted asfollows: “

c

” and “

w

” indicate that the vector is given inthe camera reference frame and the reference frame ofthe 3D scene, respectively; and

n

denotes the index of avector. The superscript

t

in this section of the paper des-ignates matrix transposition and

T

denotes a displace-ment vector. A point

p

n

of a scene can be represented,in the camera reference frame, as

p

n

= [

x

n

,

y

n

,

z

n

]

t

and,in the world reference frame (

O

w

,

X

w

,

Y

w

,

Z

w

), as =

[ , , ]

t

. The projection of a 3D point onto a dis-crete image plane can be represented as

u

n

= [

ϕ

i

n

,

ϕ

j

n

,

pn'

xn' yn' zn'

−

1]

t

, where

ϕ

= 1/

f

pix

is an intrinsic camera parameterreciprocal to the focal length expressed in pixels.Pixel grid indices

i

and

j

belong to the domain

IxJ

:

i

∈

–

I

/2, …, 0, …,

I

/2,

j

∈

–

J

/2, …, 0, …,

J

/2, where

I

and

J

are the vertical and horizontal dimensions ofthe CCD matrix. The 3D and image coordinates of apoint,

p

n

and

u

n

, are related by

. (1)

The value of the camera parameter

ϕ

in this relation-ship is supposed to be known in advance or previouslyestimated. Therefore, according to (1), to obtain 3Dcoordinates of all visible points, it suffices to determinethe depth of the point

Z

n

. On the other hand, the matrixof transformation from the camera reference frame tothat of the scene can be obtained once 3D coordinatesof three noncollinear points are known in the scene ref-erence frame. Therefore, camera calibration can beaccomplished in two steps: (1) determining the depthsof vertices of the calibration triangle; and (2) calculat-ing the rotation matrix

R

and the translation vector

T

that relate the reference frames of the camera and the3D scene. The depth values

z

1

,

z

2

,

z

3

of the calibrationtriangle are determined in this work by the methoddescribed in [13]. Once the depth values are uniquelydetermined, the matrix

R

and the vector

T

can be found.Given a rotation matrix and a translation vector, thetransformation between coordinates associated with thecamera and the visual scene can be written as

(2)

First of all let us introduce a new reference frame forthe visual scene. Let its origin coincide with one of theapexes of our triangle

O

w

=

p

1

and let one of its sides bethe

X axis of the new coordinate system. Let the trianglelie in the plane defined by z = 0. Then the basis of thisreference frame is determined by the equations:

pn Znun=

pn R pn' T.+=

(a)

(b)

Fig. 1. (a) The source image and (b) the result of its segmen-tation with ground points u1 and u2.

Zw Yw

Xw

p3

p2

p1

u1

u2

u3

XcZc

Yc

p0

Fig. 2. A perspective camera model.

u1u2

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 19 No. 1 2009

SOLVING THE MULTI OBJECT OCCLUSION PROBLEM 167

(3)

Now the rotation matrix and the translation vectorcan be expressed in terms of the new basis as

(4)

Let i, j be a point of interest identified by thetracking algorithm in an image captured by of one thecameras (in our case, it is a ground point of a trackedobject). In order to be able to match it to other points ofinterest found in images shot by other cameras, thispoint has to be projected back onto the plane of thevisual scene. This procedure is often called the inverseperspective projection. It other words, we need to havea function p'(i, j) that projects an arbitrary point i, j inthe image plane onto the real plane of the scene. First,we find coordinates of the optical center of the camerain the scene reference frame:

(5)

A pure rotation can be written as

(6)

and so the function we seek is given by

(7)

We suppose that the observed scene is a plane. There-fore, inverse projections of the corresponding points inimages of two cameras must align. The result of inverseprojection is shown in Figs. 3a and 3b. As a test of calibra-tion accuracy, the images taken by two different cameraswere projected onto the scene plane Figs. 1a, 3b and super-imposed to find out how well they matched. One can seefrom Fig. 3c that details of both images are indeed alignedfor an arbitrarily chosen boundary.

4. FORMING PATHS OF OBJECTS

Consider a scenario involving movement of twoobjects as illustrated in Fig. 4. We are now ready to for-mulate the two main issues of occlusion disambigua-tion: (i) which of the two possible paths [A, A'] and [A,B'] belongs to one object? and (ii) which of the twoobjects A or B is visible during occlusion over the pathsegment [C, D]?

The information contained in image sequences cap-tured by just one camera (Fig. 4a) is not sufficient toanswer these questions. Indeed, when an object isoccluded by another object (segment [C, D]), bothpoints (of objects A and B) become virtual, whichmeans they are logically inseparable and their distance

Xw

p3 p1–( )p3 p1–

---------------------;=

Zw

Xw p2 p1–( )Xw p2 p1–( )

-------------------------------;=

Yw ZwXw.=

R Xw Yw Zw[ ]; T p1.= =

p0' Rt p0 T–( ) RtT.–= =

u i j,( ) Rtu i j,( ) Rt ϕi ϕj 1–, ,[ ]t;= =

(

p' i j,( ) p0' u i j,( )z0

zij

-----.–= (

(

in the image plane is zero. However, information presentin images taken by the second camera (Fig. 4b) providesthe needed clue. Indeed, there remain just two caseswhen points of the two objects cannot be distinguished:(a) when both pairs of points of the two interactingobjects are virtual; and (b) when one pair of points is vir-tual on one image and, on the next image, another pair ofpoints becomes virtual. Even for systems with just twocameras, such situations are exceptional, and so the datamissing for the segment [C', C] can be recovered. It isclear that, by virtue of having the second camera, we areable to furnish an answer to the second question: byreferring to coordinates of the visible pair of points (cam-era 2 in Fig. 4b), it can be computed which object iscloser to camera 1 and is, thereby, the occluder.

Let the system consist of k ∈ 1, …, K cameras Ok

and suppose that sequences of T images It, k captured byeach of the cameras over the period t ∈ 1, …, T areanalyzed. The local tracking module produces a set ofground points

(8)Q0 pk n,t ;

t k n, , Ω0∈∪=

(

(a)

(b)

(c)

Fig. 3. (a) Inverse projection of an image captured by cam-era 1. (b) Inverse projection of the image captured by cam-era 2. (c) A composite image obtained as a superposition ofthe two inverse projections.

168


MOZEROV et al.

where p is a point of the scene, and its subscript andsuperscript domain Ω0 is defined as follows: k ∈ 1, …,K, t ∈ 1, …, T, and n ∈ 1, …, N0(k, t), whereN0(k, t) is the number of ground points obtained for thecorresponding image taken at time t.

Because all parts of the scene are not necessarilyvisible to all cameras, the number of points obtained attime t is not a direct function of the overall number ofcameras k in the system and the number of observedobjects. Therefore it would be convenient to supple-ment the already obtained set of ground points with vir-tual ones in order to make the total number proportionalto the number of cameras:

(9)Q pk n,t ;

t k n, , Ω∈∪=

where the definition of the domain is similar to thatof Ω0 except that the number N0(k, t) is replaced by the

number (k, t) of the added virtual points.The local algorithm determines almost all virtual

points (by way of example, refer to point inFig. 5) except those outside the visual filed of one of thecameras. In this case we use a simple algorithm anddefine the position of the virtual point as average of thenearest visible points. For example, a virtual point in Fig. 5 is generated as average of the two points p1, 2and p2, 3 seen by their respective cameras. Now theoverall number of points in hand becomes proportionalto the constant k of the system,

(10)

It follows that the set of points Qt at a time t consistsof K × N(t) points pt,

(11)

We are going to define the operation of selection

represented by an arbitrary matrix , taking on one ofS(t) possible values depending on its subscript s ∈ 1,…, S(t). The size of this matrix at time t equals K ×N(t) and it is used to retrieve K-point sets from theset Qt consisting of K × N(t) points

(12)

where the symbol [°] denotes the selection operation

(13)

Elements ak, n of the selection matrix are 0 or 1(TRUE or FALSE) but in every row there is just onenonzero value. It follows that the total number of all

such matrices is N(t)K. One consequence of this def-

inition is that any selected set consists of K points

Ω

N

p p1 3,

p3 3,

N t( ) N0 k t,( ) N k t,( ).+=

Qt pk n,t .

k n, K N t( ),∈∪=

ast

Pst

Pst as

t ° Q

t,=

a ° Q ak n, pk n,∩( ).k K∈ n, N∈

∪=

ast

Pst

B'

A'

D

C

AB

A'

B'

Camera 1 Camera 2

D

C

C'

AB

(b)

(a)

Fig. 4. (a) Path points as seen by one camera; (b) Superpo-sition of points of the same path obtained from differentcameras.

p2, 3

p1, 2

p3, 2

p1, 1

p2, 1

p2, 2

p3, 1

X

Y

p1, 3

p3, 3

~

~

Fig. 5. Unidentified points.



with different indices k. Moreover, if the separation

between points of is minimal over all possible selec-tions, then it would be reasonable to assume that thisselection represents path points of one object. There-fore the path of an object Os can be represented as asequence of selections

(14)

In order to work out a path generation procedure, weintroduce a composite distance consisting of two terms,

an interframe distance

(15)

and an intraframe distance for the given selection

(16)

It can be reasonably assumed that the sum of com-posite distances (both (15) and (16)) over time is mini-mal when the selected set of points belongs to oneobject. On this grounds, the following optimizationproblem can be formulated: find a sequence of selection

matrices such that the criterion function

(17)

attains a minimum. Here [τ]s is a shorthand notation

[ + ] for the path Os ( and denote, respec-tively, the first and the last frames of the sequence).

This problem can be solved by recursion

(18)

where

(19)

The solution is obtained by direct evaluation of all

2N(t)K possible values taken on by the function +

, and the paths are expressed in terms of local solu-tions as

(20)

In this work, a simple rule is used to determine the

interval [ + ]. First, the constant is determinedas the frame number where the number s occurred forthe first time. If any of the points obtained at thismoment are not assigned to already existing paths, thenthe index of this frame becomes the first element of the

Pst

Os Pst .

t T∈∪=

∆st

∆st Ps

t Pst 1––=

= ak n s, ,t al m s, ,

t 1– pk n,t pl m,

t 1–– ;n N t( ) m, N t 1–( )∈ ∈

∑k l, K∈∑

Dst Ps

t

Dst ak n s, ,

t al m s, ,t pk n,

t pl m,t– .

n m, N t( )∈∑

k l, K∈∑=

ast

ast

τs0

t τsf≤ ≤

∪ Dst ∆s

t+( )t τ[ ]s∈∑

ast

minarg=

τs0 τs

f τs0 τs

f

ast

t τ≤∪ as

t

t τ 1–≤∪⎝ ⎠

⎛ ⎞ asτ,∪=

asτ Ds

τ ∆sτ+( ).

ast

minarg=

Dst

∆st

Os ast ° Q

t.t τ[ ]s∈∪=

τs0 τs

f τs0

new path Os + 1. The constant for a path Os is com-

puted by comparing the distance with a preset

threshold Dthr. If Dthr ≤ , then = t.

5. RESULTS OF EXPERIMENTS

Our method was tested on two benchmark data-bases: HERMES and PETS. Two frames of a sequencefrom HERMES are shown in Fig. 6. This database con-tains sequences with different object motion scenariosand captured from different vantage points.

To expose the potential of the outlined approach, atypical situation featuring a long occlusion was chosenfor testing. In this sequence, both objects remain visibleto the first camera (Fig. 6a) through the entire trackinginterval but, in the field of view of the second camera,one the objects occludes another over a certain periodas shown in Fig. 6b. The task is to divide the groundpoints (black points in Fig. 7c) into two paths. Theresult of clusterization is shown in Fig. 7c, where pointsof different shapes represent two different paths. Com-parison of the obtained paths with ground truth shows

τsf

Dst

Dst τs

f

(a)

(b)

Fig. 6. Images taken by (a) camera 1 and (b) camera 2.

170


MOZEROV et al.

good alignment between visible paths and the groundtruth data. Another advantage of the algorithm is itsability to reconstruct the invisible part of the path bymeans of virtual points, as illustrated in Figs. 7d–7f.The obtained selection of path points in the field cov-ered by camera 2 includes one such segment (Fig. 7e).In order to check the actual visibility of points, the algo-rithm compares coordinates of points in the visible seg-ment by applying the ordering criterion described insection 4 of the paper. The outcome of such a compari-son is shown in Fig. 7f, where the segment [C, D] isinvisible.

In order to ensure adequate operation of the algo-rithm, the value of the threshold Dthr, introduced inSection 4, must be correctly chosen. The maximumcalibration error was found to be 12 cm, which issmaller than the dynamic error of the algorithm. Themaximum segmentation error was 43 cm, whichexceeds the variation of the dynamic error of the algo-rithm equal to 19 cm; on this grounds the thresholdDthr was set to 43 cm.

6. CONCLUSION

An algorithm to match paths of observed objectswas presented. Results of simulation tests show thatobject paths can be efficiently tracked even in the pres-ence of occlusion.

ACKNOWLEDGMENTS

This work was supported by a European Commu-nity grant no. IST-027110 (HERMES) and by a Spanishprogram Ramon y Cajal.

REFERENCES1. L. Wang, W. Hu, and T. Tan, “Recent Developments in

Human Motion Analysis Pattern Recognition” 36, 585(2003).

2. T. B. Moeslund, A. Hilton, and W. Krüger, “A Survey ofAdvances in Vision-Based Human Motion Capture andAnalysis,” Int. J. of Computer Vision and Image Under-standing 104, 90 (2006).

3. T. I. Haritaoglu, D. Harwood, and L. S. Davies, “W4:Real-Time Surveillance of People and Their Activities,”IEEE Trans. PAMI 22, 809 (2000).

4. C. Stauffer and W. E. L. Grimson, “Learning Patterns ofActivity Using Real-time Tracking,” IEEE Trans. PAMI22, 747 (2000).

5. S. Park and J. K. Aggarwal, “A Hierarchical Bayesian Net-work for Event Recognition of Human Actions and Interac-tions,” ACM Journal on Multimedia Systems 10, 164 (2004).

6. W. Hu, T. Tan, L. Wang, and S. Maybank, “A Survey onVisual Surveillance of Object Motion and Behaviors,”IEEE Trans. SMC 34, 334 (2004).

7. P. Guha, A. Mukerjee, and K. S. Venkatesh, “EfficientOcclusion Handling for Multiple Agent Tracking byReasoning with Surveillance Primitives,” in Proceedingsof Workshop on Visual Surveillance and PerformanceEvaluation of Tracking and Surveillance, Beijing,China, October 2005, pp. 49–56 (2005).

800600400200

0–200

0–2000

2000 1400

12001000

800600400200

0–200

0–2000

2000 1400

12001000

800600400200

0–200

0–2000

20001400

12001000

Frames

xx

x

Frames

Frames

y y y

(a) (b) (c)

(d) (e) (f)

Fig. 7. (a–c) Ground-truth paths of two persons: (a) camera 1; (b) camera 2; (c) superposition of paths visible from different vantagepoints. (d–f) Paths of persons in space-time coordinates: (d) camera 1; (e) camera 2; and (f) superposition of paths obtained fromthe two cameras.

DD

CC



8. P. Peursum, K. S. Venkatesh, and G. West, “Observation-switching Linear Dynamic Systems for TrackingHumans through Unexpected Partial Occlusions byScene Objects,” in Proc. ICPR, 2006, vol. 6, pp. 929–934.

9. Q. Zhou and J. K. Aggarwal, “Object Tracking in an Out-door Environment by Using Fusion of Features and Cam-eras,” Image and Vision Computing 24, 1244 (2006).

10. R. Tsai, “A Versatile Camera Calibration Technique forHigh-Accuracy 3D Machine Vision Metrology UsingOff-the-shelf TV cameras and Lenses,” IEEE J. RobotAutom. 3, 323 (1987).

11. Z. Zhang, “A Flexible New Technique for Camera Cali-bration,” IEEE Trans. PAMI 22, 1330 (2000).

12. L. Lee, R. Romano, and G. Stein, “Monitoring Activitiesfrom Multiple Video Streams: Establishing a CommonCoordinate Frame,” IEEE Trans. PAMI 22, 758 (2000).

13. R. Haralick, C. Lee, K. Ottenberg, and M. Nolle,“Review and Analysis of Solutions of the Three-PointPerspective Pose Estimation Problem,” Int. J. of Com-puter Vision 13, 331 (1994).

14. T. Yang, S. Z. Li, Q. Pan, and J. Li, “Real-time MultipleObjects Tracking with Occlusion Handling in DynamicScenes,” in Proc. IEEE CVPR 2004, 2005, pp. 970–975.

15. I. Huerta, D. Rowe, M. Mozerov, and J. Gonzälez,“Improving Background Subtraction Based on Casuistryof Colour-Motion Segmentation Problems,” in Proc. Int.Conf. 3rd IbPRIA, Gerona, Spain, May 2007, pp. 475–482.

Mikhail Mozerov. Graduatedfrom Moscow State University in1982 and received Candidate degree(Eng.) from Institute of InformationTransmission Problems, RussianAcademy of Sciences, in 1995. Cur-rently a Project Director with theComputer Vision Center of Universi-tat Autónoma de Barcelona (UAB),Spain. Scientific interests: signal andimage processing, pattern recognition,and digital holography.

Ariel Amato. Received ElectronicEngineer degree from UniversidadTecnológica Nacional (UTN-FRC),Argentina, and MS degree in Com-puter Vision from UniversitatAutónoma de Barcelona (UAB),Spain, in 2007. Currently a PhD stu-dent at the Computer Vision Center ofUAB. Scientific interests: active cam-era control, segmentation, tracking,and human motion understanding.

Jordi González. Received PhDdegree from Universitat Autónoma deBarcelona (UAB), Spain, in 2004.Currently holds the position of a Juande la Cierva posdoctoral researcher atInstitut de Robótica i InformáticaIndustrial (UPC-CSIC). Scientificinterests: cognitive evaluation ofhuman behaviors in image sequences.Author of more than 70 papers onactive camera control, segmentation,tracking, human motion understand-

ing (interpretation and reasoning), natural language text gen-eration, and automatic behavioral animation. Participated asWorkpackage Leader in the European projects HERMES andVIDI-Video. He has co-founded the Image Sequence Evalua-tion research group with Computer Vision Center in Barce-lona.

Xavier Roca. Graduated fromUniversitat Autónoma de Barcelona(UAB) in 1990 and was awarded PhDdegree in Computer Sciences by thesame university in 1998. Currently aDirector of Computer SciencesDepartment at UAB. Author of30 papers published in proceeding ofnational and international conferencesand in international journals.

solving the multi object occlusion problem in a multiple camera tracking system

Documents