a noise robust method for 2d shape estimation of moving objects in video sequences considering a...

*Corresponding author. Tel.: #49-511-762-5308; fax: #49-511-762 5333; e-mail: [email protected].

Signal Processing 66 (1998) 203—217

A noise robust method for 2D shape estimation of moving objects invideo sequences considering a moving camera

Roland Mech*, Michael Wollborn

Institut fu( r Theoretische Nachrichtentechnik und Informationsverarbeitung, Universita( t Hannover, Appelstra}e 9A, 30167 Hannover, Germany

Received 29 April 1997; revised 28 September 1997

Abstract

For the coding of image sequences at very low bit rates, it has been shown that the coding efficiency can be increased bytaking into account the shape of moving objects in the scene. Moreover, the upcoming ISO/MPEG-4 standard considersthe 2D shape of moving objects not only for reasons of coding efficiency, but also to provide the user with so-calledcontent-based functionalities. However, the perfect automatic segmentation of moving objects in image sequences is stillan unsolved problem. In this paper, an algorithm for automatic, noise robust 2D shape estimation of moving objects invideo sequences is presented, which considers a moving camera. The algorithm consists of four main steps. In the firststep, a possibly apparent camera motion is estimated and compensated. By the second step, a possibly apparent scene cutis detected, and if necessary the segmentation algorithm is reset. In the third step, a change detection mask is estimated bya relaxation technique, using local thresholds which consider the state of neighbouring pels. By the fourth step, regionswhere the background has been uncovered by a moving object are estimated, using motion information froma displacement vector field. Finally, the resulting object mask is adapted to luminance edges in the corresponding image,in order to improve the shape accuracy. Furthermore, the temporal coherency of the object mask is improved byapplying a memory. The proposed algorithm is compared to an approach for 2D shape estimation of moving objectsfrom the literature, which does not use any edge adaptation or object memory and which uses only one global threshold.The experimental results show that the resulting object shapes from the proposed algorithm look subjectively muchbetter than those from the reference algorithm. ( 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Es ist bekannt, da{ die Effizienz der Codierung von Bildsequenzen bei geringen Datenraten durch Berucksichtigungder Form der bewegten Objekte in einer Szene erhoht werden kann. Daruberhinaus berucksichtigt der ISO/MPEG-4Standard die 2D-Form bewegter Objekte nicht nur ausschlie{lich fur die Codiereffizienz, sondern auch um dem Benutzersogenannte content-based functionalities zur Verfugung zu stellen. Die perfekte automatische Segmentierung bewegterObjekte ist jedoch nach wie vor ein ungelostes Problem. In diesem Beitrag wird ein gegenuber Rauschen robusterAlgorithmus zur automatischen Schatzung der 2D-Form bewegter Objekte in Video-Sequenzen vorgestellt, welcher einebewegte Kamera berucksichtigt. Der Algorithmus besteht im wesentlichen aus vier Schritten. Im ersten Schritt wird einemogliche Kamerabewegung geschatzt und kompensiert. Der zweite Schritt detektiert einen moglichen Szenenschnitt undsetzt gegebenenfalls den Segmentieralgorithmus in seinen initialen Zustand zuruck. Im dritten Schritt wird mittels einer

0165-1684/98/$19.00 ( 1998 Elsevier Science B.V. All rights reserved.PII S 0 1 6 5 - 1 6 8 4 ( 9 8 ) 0 0 0 0 6 - 1

Relaxationstechnik, die unter Berucksichtigung der benachbarten Bildpunkte lokale Schwellwerte verwendet, eineA® nderungsdetektionsmaske geschatzt. Im vierten Schritt werden solche Regionen detektiert, in denen durch Bewegungvon Objekten Hintergrund aufgedeckt worden ist. Hierzu wird Bewegungsinformation aus einem Displacementvektorfeldverwendet. Schlie{lich wird die resultierende Objektmaske an Luminanzkanten im zugehorigen Bild adaptiert, um dieGenauigkeit der Formschatzung zu erhohen. Desweiteren wird die zeitliche Koharenz der Objektmasken durchVerwendung eines Gedachtnisses verbessert. Der vorgeschlagene Algorithmus wird mit einem Ansatz aus der Literaturzur Schatzung der 2D-Form bewegter Objekte verglichen, der weder eine Kantenadaption noch ein Gedachtnisbeinhaltet und nur eine globale Schwelle verwendet. Die experimentiellen Ergebnisse zeigen, da{ die geschatztenObjektformen des hier vorgeschlagenen Algorithmus subjektiv deutlich besser aussehen, als die des Referenzalgorith-mus. ( 1998 Elsevier Science B.V. All rights reserved.

Resume

Dans le cadre du codage de sequences d’images a tres bas debit, on a montre que l’efficacite du codage peut etreamelioree en tenant compte de la forme des objets animes dans la scene. De plus, le futur standard ISO/MPEG-4 neprend pas seulement en compte la forme 2D des objets en mouvement pour des raisons d’efficacite de codage, maisegalement pour offrir a l’utilisateur des fonctionnalites dites “basees sur le contenu”. Cependant, la technique parfaite desegmentation automatique d’objets animes dans des sequences d’images reste encore un probleme non resolu. Dans cetarticle, un algorithme est presente pour la segmentation automatique et resistante au bruit d’objets animes dans lessequences video, considerant de plus une camera en mouvement. L’algorithme se decompose en quatre partiesprincipales. Dans la premiere, le mouvement eventuel apparent de la camera est estime et compense. Dans la secondepartie, une eventuelle coupure de scene apparente est detectee, et au besoin l’algorithme de segmentation est reinitialise.Un masque de detection de changement est estime dans la troisieme partie par une technique de relaxation utilisant desseuils locaux qui prennent en compte l’etat des pixels avoisinants. Dans la quatrieme partie, les regions ou l’arriere-planont ete decouverts par un objet en mouvement sont estimees en utilisant l’information de mouvement obtenue par unchamp de vecteurs de deplacement. Le masque d’objet resultant est finalement adapte aux transitions de luminance dansl’image correspondante afin d’ameliorer la precision de forme. En outre, la coherence temporelle du masque d’objet estamelioree via l’utilisation d’une memoire. L’algorithme propose est compare a une approche d’estimation de formes 2Dd’objets animes trouvee dans la litterature. Cette derniere n’utilise pas d’adaptation des contours ni de memorie d’objetmais uniquement un seuillage global. L’experimentation montre que les formes d’objet resultant de l’algorithme proposesemblent subjectivement bien meilleures que celles de l’algorithme de reference. ( 1998 Elsevier Science B.V. All rightsreserved.

Keywords: Very low bit-rate coding; Object-based image coding; Content-based; Videophone; MPEG-4; Segmentation;Shape estimation

1. Introduction

For the coding of image sequences at very low bitrates, standardized block-based hybrid coding tech-niques [5,12] transmit only texture and motioninformation in order to minimize the requiredtransmission bit rate at a specific image quality.However, within the area of object-based analy-sis—synthesis coding [9,10,23,24] it has been shownthat shape information with respect to arbitrarilyshaped moving objects in the captured scene can beused to improve the coding efficiency, too. More-over, the upcoming ISO/MPEG-4 standard [21,26]considers such 2D shape information not only for

the coding efficiency but also to allow for theso-called content-based functionalities. Therefore,no longer only rectangular frames are transmitted,but the image sequence can consist of one or more2D Video Object Planes (VOPs) [6,26]. These VOPsmay be of arbitrary shape, so shape informationhas to be considered in the coding and decodingalgorithms. Due to this VOP-based coding, specificparts of the scene can be addressed by the user andcan be treated differently from other parts. Thisallows the realization of a couple of functionalitieslike object-based spatial and temporal scalabilty[30], users interacting with the image content, etc.However, it is not specified in the standard how the

204 R. Mech, M. Wollborn / Signal Processing 66 (1998) 203–217

VOPs are generated, i.e. where the segmentationinformation comes from. Since this task is independentof the standard, it is and will be open for the user.This makes it possible to use, e.g., colour-keyingtechniques, interactive segmentation or fully auto-matic algorithms, depending on the requirementsgiven by the respective application.

Up to now, the automatic segmentation of mov-ing objects in image sequences is an unsolvedproblem. Current approaches are either based onmotion information, on texture information or onboth, i.e. on motion as well as on texture informa-tion. In [15,25], an estimated displacement vectorfield is clustered using an affine motion model,resulting in a number of regions with similar motion.These regions are furtheron smoothed by a relax-ation technique. However, since the estimation ofthe displacement vector field leads to errors espe-cially at object boundaries, the segmentation is notaccurate enough. Furthermore, this technique leadsin general to oversegmentation, i.e. the number ofregions is larger than the number of moving objectsin the scene. More accurate results with respect tothe estimation of object boundaries are achieved bysegmentation techniques which consider motionand texture information, as e.g. proposed in [3,8,29].E.g. in [8], an image is first segmented using itstextural information, resulting in regions with sim-ilar texture. Then, these regions are merged de-pending on the similarity of their motion. As motioninformation, either a dense displacement vector fieldor an affine motion model for each region is used.Due to the successive application of texture andmotion analysis, these techniques still have someproblems especially at object boundaries. Therefore,the approach in [27,28] proposes a simultaneousestimation of the texture segmentation and a densedisplacement vector field by minimizing the dis-placed frame difference of two successive images.One disadvantage of this technique is its highcomplexity due to the estimation of a dense dis-placement vector field. Furthermore, in general,this approach also results in an oversegmentationwith respect to the number of moving objects in thescene. An approach which uses mainly theluminance difference of two successive images ispresented in [11]. Here, after thresholding thedifference image using a global threshold, motion

information from an estimated displacement vectorfield is used to eliminate regions of backgroundbeing uncovered. This technique avoids the problemof oversegmentation, since it distinguishes onlymoving foreground and static background. Draw-backs are first the use of a global threshold, whichmakes the algorithm very sensitive against noise.Second, since no texture or colour information isused directly for segmentation, the resulting objectboundaries are not accurate enough.

In this paper, an automatic, noise robust segmen-tation technique for 2D shape estimation of movingobjects in video sequences considering a movingcamera is presented, which overcomes the problemsof inaccurate object boundaries and noise sensitivity.In order to avoid the problem of oversegmentation,the presented approach is based on the methoddescribed in [11], which is therefore used as a refer-ence. Objects are assumed to be non-overlapping.Furthermore, parts of the work are based on a for-mer segmentation algorithm which has been de-veloped in the framework of object-based analysis—synthesis coding [10] and has been lateron adaptedto the Simulation Model for Object-based Coding ofthe COS¹ 211ter Simulation Subgroup (SIMOC) [7].The proposed algorithm estimates and compensatespossibly apparent camera motion, first. For thatapproach the eight-parameter motion model de-scribed in [11] is used. Then, a possibly apparentscene cut is detected, and if necessary the segmenta-tion algorithm is reset. The presented algorithm ismainly based on a change detection step, usinga local thresholding technique. Therefore, the relax-ation technique proposed in [1,2] is used whichtakes into account local neighbourhoods. Further-more, in order to get more time coherent segmenta-tion results, a memory for the change detectionmasks is applied. After the change detection, anestimated displacement vector field is used in orderto subdivide the changed area into moving objectarea and area of uncovered background. Finally,the boundaries of the resulting object regions areadapted to luminance edges of the current image inorder to improve the accuracy of the estimatedobject shape. In order to cope with different kindsof sequences, the parameters used in the differentsteps adapt automatically to the current sequence,starting with appropiate initial values.

R. Mech, M. Wollborn / Signal Processing 66 (1998) 203—217 205

Fig. 1. Block diagram of the segmentation algorithm.

The proposed algorithm is still under investiga-tion within the framework of an MPEG-4 coreexperiment on automatic segmentation. Resultshave been presented at the MPEG meetings inFirence [16], Tampere [17], Chicago [18] andMaceio [19] in 1996 and in Sevilla [20] and Bristol[22] in 1997. A previous version of this algorithmhas been presented at ICASSP’97 [13]. This versionhas been extended by consideration of a movingcamera and has been presented at WIAMIS’97 [14].

The paper is organized as follows. In Section 2,the principle of the algorithm is described. In Sec-tions 3—6, the estimation and compensation of cam-era motion, the scene cut detection and the estimationof the change detection mask and the object mask,which represents the 2D shapes of the movingobjects in the scene, are described, respectively. InSection 7, experimental results are given. Section 8gives a conclusion of the main results of the paper.

2. Principle of the algorithm

A segmentation of each frame of an image se-quence into non-overlapping moving objects (de-noted as foreground) and static and uncoveredbackground (denoted as background) is performed,considering a moving camera. Fig. 1 gives an over-view of the proposed segmentation method.

The proposed segmentation method can be sub-divided into the following four steps: by the first step,an apparent camera motion is estimated and com-pensated using an eight-parameter motion model [11].

In the second step, an apparent scene cut orstrong camera pan is detected by evaluating theMSE between the two successive frames, consideringonly background regions of the previous frame. Incase of a scene cut the algorithm is reset.

By the third step, an initial change detectionmask (CDMi) between two successive frames isgenerated. In this mask, pels with image luminancechanged due to a moving object are marked. Afterthat, boundaries of changed image areas aresmoothed by a relaxation technique [1,2], resultingin a mask, called CDMs. Thereby, the algorithmadapts frame-wise automatically to camera noise.While in [1] the variance of camera noise and thevariance of the difference frame within object regions

are both measured based on the CDMi, here theprevious mask of object regions (OM) is additionallyused, where pels are marked which belong to a mov-ing object. In order to get temporally stable objectregions, a memory for change detection masks(CDM) is applied, denoted as MEM. The temporaldepth of MEM adapts automatically to the se-quence. Now, the CDMs is simplified by usage ofa morphological closing-operator and eliminationof small regions, resulting in the final CDM.

In the fourth step, an initial object mask (OMi) iscalculated by eliminating the uncovered backgroundareas from the CDM as in [11]. Therefore, dis-placement information for pels within the changedregions is used. The displacement is estimated bya hierarchical blockmatcher (HBM) [4]. For a high-er accuracy of the calculated displacement vectorfield (DVF), the CDM from the first step is con-sidered by the HBM. Finally, the boundaries of theOMi are adapted to luminance edges in the corre-sponding image in order to improve the shapeaccuracy. The result is the final object mask (OM).


3. Camera motion estimation and compensation

Given two successive frames I(k)

and I(k`1)

of animage sequence, an apparent camera motion isestimated and compensated. For this purpose,a method is used which estimates the motion para-meters of a rigid plane object within a 3D space[11]. The method is based on an affine motionmodel. Its eight parameters a

1, a

2, a

3, a

4, a

5, a

6, a

7,

a8

can reflect any kind of motion, especially zoomand pan. For every pel (X

0, ½

0) in frame I

(k)the

corresponding pel (X1, ½

1) in frame I

(k`1)is given by

X1"

a1X

0#a

2½

0#a

3a7X

0#a

8½

0#1

, (1)

½1"

a4X

0#a

5½

0#a

6a7X

0#a

8½

0#1

. (2)

The camera motion is estimated by regressionconsidering only pels within background regions ofthe previous frame. In case of the first frame or aftera scene cut has been detected the backgroundregions are not known. Therefore, pels which dis-tance from the left or right border is less than 10pels are used as observation points, assuming thatthere is no moving object near the left and rightimage border.

After the estimation of the motion parameters,the displacement vector for every pel is known.Then, a postprocessing step counts for model fail-ures in background regions due to the assumptionof a rigid plane. By this postprocessing step, theestimated displacement vector of every backgroundpel is improved by perfoming a full search within anarea of limited size. Of course, only small modelfailures can be handled by this postprocessing step.If there are larger deviations from the model ofa rigid plane background, the estimation fails. It isplanned for future work to model the backgroundby more than one rigid plane, and to use a morerobust optimization method for the parameter es-timation.

For camera motion compensation the bilinearinterpolating function is used. The camera motionis only compensated if a moving camera has beendetected. This is the case, if the absolute value ofone of the estimated motion parameters (a

1!1),

a2, a

3, a

4, (a

5!1), a

6, a

7, a

8is greater than 2.5.

4. Scene cut detection

In order to allow the segmentation of sequenceswith a strong camera pan or a scene cut, a scene cutdetector evaluates whether or not the differencebetween the current original frame I

(k`1)and the

camera motion compensated previous frameICMC(k`1)

exceeds a given threshold tSC

. In detail, themean-square error of the frame difference is cal-culated in the background regions of the previousframe. A scene cut or strong camera pan is detectedby applying the following rule:

G1

NBG

+Mp@OM(k)(p)/0N

(I(k`1)

(p)!ICMC(k`1)

(p))2H't

SCN scene cut,

)tSC

N no scene cut.(3)

In Eq. (3), NBG

denotes the number of pels set to‘0’ in the previous OM (background pels). If a scenecut or strong camera pan is detected, the segmenta-tion algorithm is reset, i.e. all parameters are set totheir initial value.

5. Estimation of the change detection mask

The algorithm for the estimation of theCDM can be subdivided into several steps as shownin Fig. 2. In the following these steps will bedescribed.

5.1. Computation of the initial CDM

Given two successive frames I(k)

and I(k`1)

of animage sequence, first the initial CDM for the cameramotion compensated first frame I

CMC(k`1)and the

original second frame I(k`1)

is computed by a thre-shold operation on the squared luminance differenceimage, as described in [1]. The threshold is cal-culated by performing a significance test. Therefore,the luminance difference image D is modelled asGaussian camera noise with a variance p2 equal totwice the variance of the camera noise. This meansthat a luminance difference d

ifor pel i is assumed to

correspond to camera noise (null hypothesis H0)


Fig. 2. Block diagram of the algorithm for calculating the CDM.

Fig. 3. Significance test for calculating the threshold ta.

and not to moving objects (hypothesis H1) with the

following probability:

p(diD H

0)"

1

J2pp2expG!

d2i

2p2H. (4)

Assuming that neighbouring pels are statisticallyindependent, the sum of pels in the squared differ-ence image within a window of size (n]n), nor-malized by twice the variance of camera noise,

¹"

1

p2

n >n+i/1

d2i

(5)

is X2 distributed with n2 degrees of freedom.All pels for which the sum ¹ exceeds a certain

threshold ta are marked as changed, the others aremarked as unchanged. The threshold ta is chosen inthe following way: assuming the pel has not changedcaused by moving objects, the probability thata calculated sum ¹ is greater than this threshold,equals to a given value a (Eq. (6)) (see Fig. 3).

a"p(¹'ta DH0). (6)

5.2. Relaxation of the initial CDM for spatialhomogeneity

Because real objects have smooth contours, spa-tial homogeneity of the CDM is required. Spatial

homogeneity is achieved by relaxation of the CDMiusing a method proposed in [1]. There, the CDMiis processed iteratively. In each iteration step, a de-cision is made for every border pel in the CDMi, ifit belongs to the changed or to the unchanged area.Thus, for every border pel of changed image areasa local threshold is calculated taking into accountthe local neighbourhood of the pel. For the decision,if a pel has changed by moving objects or nota maximum a posteriori detector (MAP detector) isused:

P(Q#D D)

P(Q6D D)

#j6

1 Qp(D D Q

#)

p(D D Q6)

#j6

P(Q6)

P(Q#). (7)

Here, Q#means that a pel is marked changed in

the CDM, while Q6

means that this pel is markedunchanged in the CDM. D denotes the given differ-ence image between frame I

CMC(k`1)and I

(k`1). The

probability densities in Eq. (7) are modelled asfollows.

The probability density for a luminance differencedicaused by camera noise (p(DDQ

6)) is described by

the same Gaussian distribution as given in Eq. (4).The corresponding density (p(DDQ

#)) of a luminance

difference dicaused by moving objects is modelled

as a Gaussian distribution, too, but the variance ofthis distribution is the variance p2

#of the difference

image D measured in foreground regions:

P(DDQ6)"

1

J2pp2expG!

d2i

2p2H, (8)

P(DDQ#)"

1

J2pp2#

expG!d2i

2p2#H. (9)


Fig. 4. Neighbourhood of pel x and potentials B and C of theneighbours.

The two a priori probabilities in Eq. (7)are modelled by a Markov random field consider-ing a (3]3)-neighbourhood of a pel as shown inFig. 4.

P(Q6)"

1

ZexpM!E(Q

6)N, (10)

P(Q#)"

1

ZexpM!E(Q

#)N. (11)

The constant Z is for normalization, while E de-notes an energy term defined as

E(Q6@#

)"vB(u/c)B#v

C(u/c)C. (12)

The terms vB(q

k) and v

C(q

k), q

k3Mu, cN, are

a measure for the inhomogeneity of the neigh-bourhood of pel k, which is separated intohorizontal/vertical neighbours (potential B)and diagonal neighbours (potential C), as shownin Fig. 4. So, the term v

B(q

k) denotes the number

of horizontal and vertical neighbours of pel kwith the opposite label to q

k. In the same way

the term vC(q

k) denotes the number of

diagonal neighbours of pel k with the opposite labelto q

k.

From Eqs. (7)—(12) the following decision rulecan be derived [1], where d2

idenotes the squared

luminance difference of pel i:

d2i

#j6

2p2#p2

p2#!p2

]Alnp#

p#(v

B(c)!v

B(u))B#(v

C(c)!v

C(u))CB. (13)

The label of every border pel in the CDMi isdecided by using this decision rule. The rule shouldread as follows: if d2

iexceeds the threshold term on

the right-hand side of Eq. (13), the pel is set tochanged (CDMs

(k`1)(i) :" 1), otherwise it is set to

unchanged (CDMs(k`1)

(i) :" 0). The relaxation isprocessed iteratively, until only a small number ofpels are changed by relaxation or the maximalnumber of iteration steps N is reached. The soobtained CDM is denoted as CDMs.

For calculating the CDM, the variances of thestatic background p2 and within object regionsp2#

are needed. While in [1] always the CDMiis used for calculating the variances p2

#and p2, it

is more accurate to calculate the variance p2

within the unchanged regions of the CDM and thevariance p2

#within the object regions of the OM.

Thus, the variance p2#

does not include uncoveredbackground. Since the positions of static back-ground, uncovered background and moving objectsare still unknown for the current frame, in thisapproach the OM and CDM from the previousframe are used. In order to improve the accuracy ofthe measured variances, the values of the last threeimages are averaged.

5.3. Temporal coherency of the object shapes

In order to finally get temporal stable objectregions, the CDMs is connected with the previousOM, i.e. in the CDMs additionally all pels are set tochanged which belong to the OM of the previousframe. This is based on the assumption that all pelswhich belonged to the previous OM should belongto the current change detection mask. However, inorder to avoid infinite error propagation, a pelbelonging to the previous OM is only labelled aschanged in the CDMs, if it was also labelled aschanged in the CDMs of one of the last ¸ frames,


Fig. 5. An example for elimination of small regions by usinga ternary mask: (a) CMD with a large region neighboured bysmaller regions caused by noise; (b) CMD after eliminationof small regions without using a ternary mask; (c) ternarymask; (d) CMD after elimination of small regions using a ternarymask.

too. For this purpose, a storage for every pel isapplied, building a memory (MEM). The value ofthis storage indicates, if the respective pel was set tochanged in one of the ¸ previous CDMs; the value¸ denotes the depth of the memory. Consideringthat CDMs, OM and MEM are two-dimensionalfields and that MEM is the zero-matrix for the firstframe, the update of MEM can be formulated as

MEM(k`1)

(x, y)"

G¸ if CDMs

(k`1)(x, y)"1,

max(0,MEM(k)

(x, y)!1) if CDMs(k`1)

(x, y)"0.

(14)

The current CDMs is then updated by logicalOR operation between CDMs and the previousOM, taking into account the memory MEM:

CDMu(k`1)

(x, y)"CDMs(k`1)

(x, y)

s GOM

(k)(x, y) if MEM

(k`1)(x, y)'0,

0 if MEM(k`1)

(x, y)"0.(15)

It is reasonable to set the depth of the memory¸ to a large value, if the motion of the movingobjects in the scene is small, and vice versa. Thus,¸ shall be decreased with the average amplitudeD of the displacement vectors measured withinobject regions. Further, it is reasonable to increasethe depth of the memory ¸ with the size S of themoving objects. Thus, the value of ¸ is automaticallyadapted throughout the sequence, using the follow-ing heuristic adaptation rule:

¸"maxA1, ¸.!9

!

D4

cL1

!cL2

expA!10~3S

cL3B B.

(16)

As can be seen in Eq. (16), the values of ¸ arewithin the interval [1, ¸

.!9] . ¸ is decreased by the

two terms D4/cL1

and cL2

exp(!10~3S/cL3

). Thefirst term increases with the average amplitude ofthe displacement vectors within object regions pow-ered by 4. The second term decreases exponentiallywith the size of the object regions. While the con-

stants cL1

and cL3

are for normalization to differentimage formats, c

L2is a weighting factor.

5.4. Simplification of the CDM

In order to simplify the regions in the CDM, themorphological closing-operator is performed. Now,for elimination of small regions a ternary mask isgenerated, in which pels are labelled as changed, ifthey are set to ‘1’ in the CDM before and after theclosing operation. Pels, which are set to ‘0’ in theCDM after the closing operation, are set to un-changed in the ternary mask, while those pels, whichwere set to ‘0’ before and which are set to ‘1’ afterthe closing operation, are set to a third label. Withinthis ternary mask, small changed and unchanged


Fig. 6. Example for the separation of changed areas into foreground and background.

regions with a size below a certain threshold areeliminated. In order to prevent that very smallregions, which are mostly due to noise spots, growtogether with larger regions when the closing isperformed, the third label is treated in a specialway considering the label of the neighbouredregions. An example for this method is givenin Fig. 5. There, in Fig. 5(a), a CDM is showncontaining a large region, which is neighboured bymany small regions caused by noise. Fig. 5(b) pres-ents the CDM after closing is performed. It can beseen, that the noisy areas are grown together withthe large region. In order to avoid this, the ternarymask is generated as described above and as it is

shown for the example in Fig. 5(c). The CDM afterelimination of small regions by using a ternarymask is presented in Fig. 5(d). At this time the finalCDM is processed.

6. Estimation of the object mask

For estimating the OM two steps are processed.First, the uncovered background is detected andeliminated from the CDM, resulting in the OMi.Then, the OMi is adapted to luminance edges of thecorresponding frame, resulting in the final OM.Both steps are described in the following.


Table 1Parameter values used for presented simulation results

Symbol QCIF CIF

Threshold for scene cut detection t4#

250 250Threshold for estimation of CDMi ta 220 165Potentials used in energy term (Eq. (12)) B 5 5

C 2.5 2.5Maximal number of iteration steps N 20 20Adaptation radius R 6 12Maximum depth of MEM ¸

.!934 34

Weighting factors used for calculation of memory length CL1

1 15C

L280 80

CL3

1 4

Fig. 7. An exemplary result of the edge adaptation algorithm.

6.1. Detection and elimination of uncoveredbackground

Based on the estimated CDM, the OMi is cal-culated by detecting and eliminating uncoveredbackground, taking into account the displacementfor pels within the calculated CDM. Therefore,a DFV for the changed image areas is generated byhierarchical blockmatching [4]. In order to improvethe DVF at object boundaries, the matching cri-terion is only evaluated for changed regions, leadingto a so-called polygon-matching for boundaryblocks [21].

Now, pels are set to foreground, if the foot- andthe top-point of the corresponding displacementvector are both inside the changed area in thecurrent CDM [11]. If not, these pels are set tobackground. An example is given in Fig. 6, wherean object moves from the left to the right.

6.2. Adaptation of the OM to luminance edges

The OMi is adapted to luminance edges in thecurrent frame, resulting in the final OM. The ad-aptation must not exceed the following limits: the


Fig. 8. Results of both segmentation methods for QCIF-sequences at 10 Hz: (a) Mother—daughter (frame 30) segmented by the methoddescribed in [11]; (b) Mother—daughter (frame 30) segmented by the proposed method; (c) Hall—monitor (frame 30) segmented by themethod decribed in [11]; (d) Hall—monitor (frame 30) segmented by the proposed method.

outer limit of the area for adaption is given by thecontour of the OMi after morphological blowingfor two times, the inner limit is set to 6 pel.

Within these limits the luminance edges of thecurrent image I

(k`1)are calculated using a

Sobel operator. The edge adaptation techniquewarps every border pel of the OMi to the nearestluminance edge of frame I

(k`1), if there exists such

an edge within an adaptation radius R around thatpel. In Fig. 7, an example for edge adaptation isgiven: On the left side, a part of the border ofa moving object in an OM is shown before edgeadaptation is processed. Additionally, the foundluminance edges of the current image are shown.On the right side, the corresponding part of theborder after edge adaptation is presented.

In some situations it may happen that the OM isadapted to a luminance edge which belongs to thebackground and not to the foreground. This is thecase, if a moving object is separated from the back-ground only by very small luminance gradients, whilethe luminance gradients in the background are quitelarge. However, the values of the parameters whichdetermine the area for adaptation are chosen insuch a way that the appearance of this problem isneglectible for all investigated test sequences.

7. Experimental results

The proposed algorithm was applied to typicalvideoconference sequences like mother—daughter


Fig. 9. Results of both segmentation methods for CIF-sequences at 10 Hz: (a) Mother—daughter (frame 30) segmented by the methoddescribed in [11]; (b) Mother—daughter (frame 30) segmented by the proposed method; (c) Hall—monitor (frame 30) segmented by themethod decribed in [11]; (d) Hall—monitor (frame 30) segmented by the proposed method.

and akiyo, to sequences containing objects withstraightforward motion like hall-monitor and con-tainer-ship and to sequences with a moving cameralike coastguard and table-tennis. All sequences havea frame rate of 10 Hz and are tested in QCIF andCIF format. For all test sequences of a given imageformat, the same parameter values have been used,which then are partly adapted automatically to thecurrent sequence. The resulting object masks aresubjectively valued, because real masks are un-known.

The reference for the proposed segmentationmethod is the algorithm proposed in [11]. There,a change detection mask is calculated by usinga constant threshold and a non-adaptive memoryfor change detection masks, i.e. a memory with

depth 1. The object mask is generated by usage ofa DVF calculated by the hierarchical blockmatcher.As this algorithm assumes a static camera, nosegmentation results for the sequences coastguardand table-tennis can be shown as reference. For thepresented simulation results, the parameter valuesshown in Table 1 have been used.

As can be seen from the results in Fig. 8 forQCIF resolution and in Fig. 9 for CIF resolution,the resulting object masks correspond more obvi-ously to real objects than those from the referencealgorithm. This is due to the relaxation step, whichmakes the algorithm more robust against noise, tothe evaluation of the ternary mask in the eliminationof small objects and to the adaptation to luminanceedges. The temporal coherency, which cannot be


Fig. 10. Results of the proposed segmentation method for frame30 at 10 Hz: (a) Table-tennis (QCIF format), (b) Table-tennis(CIF format). It is not possible to present results of the referencemethod due to the moving camera.

demonstrated in a printed paper, is also improveddue to the application of an object mask memory.In Fig. 10 the results for a sequence with a movingcamera are shown in QCIF and CIF resolution.

Both algorithms, the reference method and theproposed method, are of quite the same computa-tional effort, because the module for hierarchicalblockmatching which is used in both methods hasby far the largest computational complexity com-pared to all other used modules. With respect to the

execution time, about 5—10 s are required for seg-menting one QCIF frame of an image sequence ona Sun Sparc Ultra (200 MHz) workstation. As allused modules are of constant memory effort, theadditional payment in terms of memory if using theproposed method instead of the reference method isa constant term.

8. Conclusions

An automatic, noise robust segmentation algo-rithm for shape estimation of moving objects invideo sequences considering a moving camera hasbeen presented. In this algorithm, the 2D shape ofmoving objects in the captured scene, denoted asobject masks, are estimated in four steps. In the firststep, a possibly apparent camera motion is estimatedand compensated, using an eight-parameter motionmodel. By the second step, a possibly apparentscene cut is detected. In the third step, a changedetection mask is estimated by a local thresholdingtechnique, using a special relaxation algorithmwhich makes the estimation more robust againstnoise. In addition, the resulting mask is smoothedby a relaxation. The temporal coherency of theresulting object masks throughout the video se-quence is improved by applying a memory forchange detection masks. The length of the memoryadapts automatically to the amount of motion ofthe moving objects and to the size of the movingobjects in the scene. In the fourth step, an estimateddisplacement vector field is used to subdivide thechange detection mask in parts belonging to themoving objects and parts of uncovered background.Finally, an adaptation of the resulting object maskto luminance edges in the current image is per-formed in order to improve the accuracy of theestimated object shapes.

Simulations with the proposed algorithm, usingdifferent kinds of test sequences, have shown sub-jectively large improvements compared to the refer-ence technique which only uses a global thre-sholding technique for change detection withoutrelaxation or luminance edge adaptation. The re-sulting masks look much better when the proposedalgorithm is applied, i.e. they correspond moreobviously to the real objects in the scene. Further,


in comparison to the reference technique, thepresented technique can also successfully be appliedto sequences with a moving camera. Finally, thetemporal coherence of the object masks is improved,although still some jittering effects can be observed.This is firstly due to the fact, that the object masksare estimated for each image independently; theapplied object memory improves the coherency,but only on a global level and not in detail at theobject boundaries. Secondly, the object mask isonly estimated with pel accuracy; however, the realboundaries will in general not be exactly on pelposition but also in between two pels. Therefore,future work will deal with subpel 2D shape estima-tion of moving objects in image sequences.

References

[1] T. Aach, A. Kaup, R. Mester, Statistical model-basedchange detection in moving video, Signal Processing 31 (2)(March 1993) 165—180.

[2] T. Aach, A. Kaup, R. Mester, Change detection in imagesequences using Gibbs random fields: a Bayesian approach,Proc. Internat. Workshop on Intelligent Signal Processingand Communication Systems, Sendai, Japan, October 1993,pp. 56—61

[3] J. Benois, L. Wu, Joint contour-based and motion-basedimage sequences segmentation for TV image coding at lowbit-rate, Visual Communications and Image Processing,Chicago, Illinois, September 1994.

[4] M. Bierling, Displacement estimation by hierachical block-matching, 3rd SPIE Symposium on Visual Communica-tions and Image Processing, Cambridge, USA, November1988, pp. 942—951.

[5] CCITT, Draft revision of recommendation H.261: videocodec for audio visual services at p]64 kbit/s, Study GroupXV,WP/1/Q4, Specialist group on coding for visual teleph-ony, Doc. No. 584, November 1989.

[6] I. Corset, Proposal for a structure allowing wide range ofcore experiments, contribution to the MPEG-4 ad-hocgroup on definition of VMs for content-based video rep-resentation, 20 December 1995.

[7] COST 211 ter simulation subgroup, SIMOC I, Doc.SIM(94)61, October 1994.

[8] C. Gu, T. Ebrahimi, M. Kunt, Morphological spatio-temporal segmentation for content-based video coding,Internat. Workshop on Coding Techniques for Very LowBit-rate Video, Tokyo, Japan, November 1995.

[9] P. Gerken, Object-based analysis-synthesis coding of imagesequences at very low bit rates, IEEE Trans. CircuitsSystems Video Technol. (Special Issue on Very Low BitRate Video Coding) 4 (3) (June 1994) 228—235.

[10] M. Hotter, Object-oriented analysis—synthesis coding basedon moving two-dimensional objects, Signal ProcessingImage Communication 2 (4) (December 1990) 409—428.

[11] M. Hotter, R. Thoma, Image segmentation based on objectoriented mapping parameter estimation, Signal Processing15 (3) (October 1988) 315—334.

[12] ITU-T, Video coding for narrow telecommunication chan-nels at (64 kbit/s, Draft Recommendation H.263, January1995.

[13] R. Mech, M. Wollborn, A noise robust method for seg-mentation of moving objects in video sequences, Internat.Conf. on Acoustic, Speech and Signal Processing, Munich,Germany, April 1997.

[14] R. Mech, M. Wollborn, A noise robust method for 2Dshape estimation of moving objects in video sequencesconsidering a moving camera, Workshop on Image Analy-sis for Multimedia Interactive Services, Louvain-la-Neuve,Belgium, June 1997.

[15] MPEG-4 Video Group, Technical Description of the VideoEncoder (Proposel for MPEG-4 Tests), Doc. ISO/IECJTCI/SC29WG11 MPEG95/504, December 1995.

[16] MPEG-4 Video Group, MPEG-4 Autimatic Segmentationof moving objects (core experiment N2) Doc. ISO/IECJTCi/SC29WG11 MPEG96/841, March 1996.

[17] MPEG-4 Video Group, MPEG-4 Automatic Segmentationof moving objects (core experiment N2), Doc. ISO/IECJTC1/SC29WG11 MPEG96/989, June 1996.

[18] MPEG-4 Video Group, MPEG-4 Automatic Segmentationof moving objects (core experiment N2), Doc. ISO/IECJTC1/SC29/WG11 MPEG96/1188, September 1996.

[19] MPEG-4 Video Group, MPEG-4 Automatic Segmentationof moving objects (core experiment N2), Doc. ISO/IECJTC1/SC29/WG11 MPEG96/1549, November 1996.

[20] MPEG-4 Video Group, MPEG-4 Automatic Segmentationof moving objects (core experiment N2), Doc. ISO/IECJTC1/SC29/WG11 MPEG97/1831, February 1996.

[21] MPEG-4 Video Group, MPEG-4 Video VerificationModel Version 7.0, Doc, ISO/IEC JTC1/SC29WG11N1642, April 1997.

[22] MPEG-4 Video Group, MPEG-4 Automatic Segmentationof moving objects (core experiment N2), Doc. ISO/IECJTC1/SC29/WG11 MPEG97/1949, April 1997.

[23] H.G. Musmann, M. Hotter, J. Ostermann, Object-orientedanalysis—synthesis coding of moving images, Signal Pro-cessing: Image Communication 1 (2) (October 1989)117—138.

[24] J. Ostermann. Object-oriented analysis-synthesis coding(OOASC) based on the source model of moving flexible3D objects, IEEE Trans. Image Process. 3 (5) (September1994).

[25] F. Pedersini, A. Sarti, S. Tubaro, Combined motion andedge analysis for a layer-based representation of imagesequences, IEEE Internat. Conf. on Image Processing,Lausanne, Switzerland, September 1996.

[26] F. Pereira, MPEG-4: A new challenge for the representationof audio-visual information, Keynote speech at PictureCoding Symp. PCS’96, Melbourne, March 1996.


[27] C. Stiller, Object-oriented video coding employing densemotion fields, Internat. Conf. on Acoustic Speech andSignal Processing, Adelaide, South Australia, April 1994.

[28] C. Stiller, Object-based estimation of dense motion fields,IEEE Trans. Image Process. 6 (2) (February 1997) 234—250.

[29] N.T. Watsuji, H. Katata, T. Aono, Morphological segmen-tation with motion based feature extraction, Internat.

Workshop on Coding Techniques for Very Low Bit-rateVideo, Tokyo, Japan, November 1995.

[30] M. Wollborn, M. Kampmann, R. Mech, Content-basedcoding of videophone sequences using automatic facedetection, Picture Coding Symposium, Berlin, Germany,September 1997.


a noise robust method for 2d shape estimation of moving objects in video sequences considering a...

Documents