shape estimation of articulated 3d objects for object-based analysis-synthesis coding (obasc)

SIGNAL PROCESSING:

Il!MAGE COMMUNICA’I’ION

ELSEVIER Signal Processing: Image Communication 9 (1997) 175-199

Shape estimation of articulated 3D objects for object-based analysis-synthesis coding (OBASC)

Geovanni Martinez

lnstitut ftir Theoretische Nachrichtentechnik und Informationsverarbeitung, Universitiit Hannover, Appelstra,Oe 9A,

D-30167 Hannover. Germany

Abstract

This paper investigates shape estimation of articulated 3D objects for object-based analysis-synthesis coding based on the source model of ‘moving articulated 3D objects’. For shape estimation three steps are applied: shape-initialization, object-articulation and shape-adaptation. Here, a new algorithm for object-articulation is introduced. Object-articulation subdivides a rigid model object represented by a mesh of triangles into flexibly connected model object-components. For object articulation, neighboring triangles which exhibit similar 3D motion during the image sequence are clustered into patches. These patches are considered to be the model object-components. For 3D motion estimation of a single triangle, a more reliable algorithm is proposed. The reliability is measured by the probability of convergency to correct parameters. For improving the reliability both, a more robust technique is applied and the triangle and its neighborhood are evaluated by the estimation algorithm. For clustering, a frame to frame clustering method which considers clustering results obtained in previous frames is presented. The developed algorithm for object-articulation is incorporated in OBASC. Typical videophone test sequences were applied. Compared to OBASC based on the source model of ‘moving rigid 3D objects’, the transmission rate decreases from 63.5 to 53 kbit/s at a fixed image quality measured by SNR. Furthermore, a realistic object articulation of a model object ‘body’ into object-components ‘head’ and ‘shoulders’ can be achieved without a priori knowledge about the scene content. 0 1997 Elsevier Science B.V.

Keywords: Object-oriented image coding; Object-based analysis-synthesis coding; Videophone; Image analysis; Rigid 3D objects; Articulated 3D objects; Object articulation; 3D shape estimation; Global and local 3D motion estimation; Segmentation; Clustering

1. Introduction

For coding of moving images at low data rates in the range of 8 to 64 kbit/s object-based analysis- synthesis coding (OBASC) [22, 251 is investigated. An OBASC scheme subdivides each image of an image sequence into uniformly moving objects and describes each object in terms of three sets of parameters defining its motion, shape and color. Color

parameters denote the luminance as well as the chrominance reflectance on the object surface. The nature of the sets of parameters depend on the source model being applied and are estimated automatically by image analysis. Assuming diffuse illumination, image regions where motion compensation fails because motion and shape parameters could not be estimated successfully are called Model Failure objects (MF-objects). The color

0923-5965/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved. PII SO923-5965(96)00018-S

176 G. Martinez I Signal Processing: Image Communication 9 (1997) 175-I 99

parameters must be coded and transmitted for MF- objects only. Since the transmission of color parameters is expensive in terms of data rate, the total size of all MF-objects should be kept as small as possible. In order to reduce the total size of MF- objects the source model of ‘moving articulated 3D object’ is used instead the source model of ‘moving rigid 3D objects’. In this contribution, an algorithm for shape estimation of articulated 3D objects is presented.

A first complete implementation of an object- based analysis-synthesis coder based on the source model of ‘flexible 2D objects with 2D motion’ (OBAS&,) has been presented by Hotter [ll, 121. According to this source model, real objects are flexible, planar and moving in the image plane.

Ostermann [23] proposes an OBASC scheme based on a source model of ‘moving rigid 3D objects’ (OBASC&. According to this source model, real objects are rigid with 3D shape and moving in the 3D space. The motion is defined by a set of 6 parameters which describes the translation and rotation of the object in the 3D space. The 3D shape is represented by a mesh of triangles which is put up by vertices denoted as control points. The color parameters are taken by projection of a real image onto the surface of the mesh of triangles. Objects may be articulated, i.e. consist of two or more flexibily connected rigid 3D object-components. Each object-component has its own set of motion, shape and color parameters. Since the shape of each object-component is defined by its control points, object-components are connected by those triangles having control points belonging to different object-components. Due to these connecting triangles, object-components are flexibly connected. Ostermann [24] showed that it is im- portant to use a source model of 3D objects with flexible shape instead of rigid shape.

For motion estimation of articulated objects, a spatially hierarchical coarse to fine approach was proposed by Koch [19] and is used by Ostermann [23] and Kampmann [15]. The shape of the object- components are assumed to be rigid and known but no spatial constraints are considered, i.e. the connecting triangles do not enforce a constraint on the spatial location of the object-components.

In [9, 211 spatial constraints are applied in order to improve motion estimation of articulated objects.

For shape estimation of articulated 3D objects, the image analysis described by Ostermann [23] applies the following three steps: shape-initialization, object-articulation and shape-adaptation.

Shape-initialization carries out a change detection to distinguish between temporally changed and unchanged regions of the two first images of the sequence. It is assumed that each changed region represents one moving object together with background uncovered due to object motion. For each changed region one rigid 3D shape represented by a mesh of triangles is then generated assuming ellipsoidal shape.

Since shape-initialization describes flexibly connected object-components by only one rigid 30 model object, motion compensation will fail when the object-components move differently. In order to subdivide each model object into flexibly connected model object-components, object-articulation is carried out. For object-articulation, Oster- mann [23] applies the algorithm proposed by Busch [3]. This algorithm considers only results obtained by the current frame in order to articulate an object and is applied to each frame of the image sequence after shape-initialization. For articulation, neighboring triangles of the model object which exhibit similar 2D motion parameters are clustered into patches. A patch is detected as an object-component if it improves 3D motion compensation. If an object-component is detected, it is considered as flexibly connected to the residual model object-components and is described using three new individual set of parameters. However, due to the 2D motion description of a triangle instead of a 3D motion description the clustering fails specially during rotation of real object-components. Since motion estimation of small triangles is unreliable only parts of real object-components are detected. Thus, the motion compensation will fail.

Shape adaptation is used to update the shape parameters of each object or object-component during the sequence.

In this contribution, an algorithm for improving the object-articulation is presented. In order to cluster neighboring triangles, 3D motion information

G. Martinez / Signal Processing: Image Communication 9 (1997) 175-199 111

instead 2D motion information is applied. In order to improve the reliability of 3D motion estimation both, a robust estimation algorithm is applied and the triangle and its neighborhood are evaluated by the estimation algorithm. In addition, clustering results obtained in previous frames are taken into account for current clustering. Hence, algorithms for 3D motion estimation and for clustering have to be developed.

The developed algorithm will be incorporated in the image analysis of an OBASC& scheme according to [23] in order to evaluate the coding efficiency. The coding efficiency is measured by the reduction of the bit-rate at a fixed image quality measured by SNR. This paper is organized as follows. In Section 2, the principle of object- based analysis-synthesis coding is described. Furthermore, the source model of ‘moving rigid 3D objects’ is defined. The proposed algorithm for 3D motion estimation of a triangle is presented in Section 3. Section 4 explains the algorithm for object-articulation and gives the details of the proposed algorithm for clustering of triangles. Parameter coding is described in Section 5. In Section 6, experimental results for real image sequences are given. Final discussion is presented in Section 7.

2. Object-based analysis-synthesis coding

OBASC [22] subdivides each image of a sequence into uniformly moving objects. Each object m is then described by three sets of parameters A’“‘, Mcrn) and S(“‘) defining its motion, shape and color, respectively. Motion parameters define the position and motion of the object and shape parameters define its shape. Color parameters denote the luminance and chrominance reflectance of the object surface. Using Fig. 1, the concept and structure of an OBASC scheme is described. OBASC requires in coder and decoder a memory for parameters in order to store the last coded and transmitted object parameters A’(m), M”“’ and S’(“‘). The stored parameters are used by the image synthesis to synthesize a model image s;, which is displayed at the decoder. The parameter sets of the memory and the current image sk + 1 are used as input to image analysis.

The image analysis estimates for each object m the new parameter sets AL?,, MiT”!, and SiTi. For moving objects, new motion and shape parameters are estimated in order to reuse most of the already transmitted and stored color parameters. Objects for which motion and shape parameters can be estimated successfully are denoted as MC-objects (Model Compliance). Image areas which cannot be

receiver model I

_-_____----*

A image ti;p:

analysis >

current parame

?I transmission channnel

1

ters t motion A’

I- shape M

Fig. 1. Block diagram of an object-based analysis-synthesis coding (OBASC) scheme.

178 G. Martinez / Signal Processing: Image Communication 9 (1997) 175-199

represented by MC-objects applying the transmitted and stored color parameters and the new motion and shape parameters A/$’ i, ML’!!’ i, respectively, are called model failures and represented by MF-objects. The MF-objects are described by 2D- shape and color parameters only. Since geometrical distortions, i.e. small position and shape errors of the MC-objects, do not disturb the subjective image quality, MF-objects are reduced to those image regions with significant differences between the motion and shape compensated prediction image and the current image sk+ i.

For coding of the sets of parameters, predictive coding is used. The motion and shape parameters are encoded and transmitted for MC-objects whereas shape and color parameters for MF-objects. Since the transmission of color parameters is expensive in terms of data rate, the total area of MF-objects should not be larger than 4% of the image area assuming a transmission rate of 64 kbit/s and a spatial resolution CIF (Common Intermediate Format) with a reduced frame freuency of 10 Hz.

At the decoder the two-parameter sets transmitted for each object class MC/MF are decoded. In the memory for parameters position and shape of MC-objects are updated. In areas of model failures, color parameters of MC-objects are substituted by the color parameters of the transmitted MF-objects.

2.1. The source model of ‘moving rigid 30 objects’

By means of the source model of ‘moving rigid 3D objects’ changes between two successive frames of an image sequence are described. Therefore, the 3D real word is modelled by a 3D model world. The real world consists of a real scene, a real illumination and a real camera. The real scene consists of real objects and their relationships. While a real image is taken by a real camera, a model image is synthesized using a model camera looking into the model world.

According to the source model of ‘moving rigid 3D objects’ the real illumination is modelled by spatially and temporally constant diffuse illumination, Furthermore, it is assumed that real objects have diffuse reflecting surfaces. A perspective cam-

era model is used whose target is the model image. By this camera model, an arbitrary point P”’ = (P$‘, P$‘, P;‘)T of the scene is projected onto the point p@’ = (p’#, J$‘)~ of the model image using the following equations:

p(i) p$’ = F& pr =&y (9

pL (1) z z

where F is the focal length of the camera, (X, Y) represent the image coordinates and (x, y, z) the model word coordinates, respectively.

In the model world, each moving real object is represented by a model object. A model object is described by three sets of parameters defining its 3D motion, 3D shape and color.

The 3D shape Mcm’ of a model object m is described by a mesh of triangles which is put up by vertices referred to as control points P& (a: 1 . . . Npt). The appearance of the model object surface is described by the color parameters S’“‘, which define the luminance and chrominance reflectance. A model object may be articulated, i.e. consists of two or more rigid object-components. Each object- component has its own set of motion, shape and color parameters. Object-components are flexibly connected to each other but no spatial constraints are considered. The BACKGROUND is also considered as a part of the scene and is described as a non-moving rigid 2D plane defined by color parameters only. Fig. 2 shows a scene with the objects BACKGROUND and CLAIRE. The model object CLAIRE consists of the two flexibly connected object-components HEAD and SHOULDER.

The 3D motion of a rigid object or rigid object- component m is described by the parameters Acm’ = (T!“‘, T/@, I’:““, Rim’, Rim’, Ri”“)T defining its translation and rotation in the 3D space. An arbitrary point Pi’ on the surface of an object m is moved to its new position Pci’ according to

p(i) = [&,I . (p”’ - C’“‘) + C(m) + T(m), (2)

with the translation vector Tcm’ = (Td”‘, TJ”‘, Tim’)T, the rotation angles Rem’ = (RJm’, RJm’, Rim’), the object center

NC’“’

C(m) = (cp’, cp, CT;*‘) = (l/N& 1 p’&, (3) a=1


scene

object BACKGROUND object CLAIRE

object-components HEAD SHOULDER

(a) (b)

Fig. 2. Model scene and articulated model object CLAIRE: (a) scene with the objects BACKGROUND and CLAIRE; (b) flexibly connected object-components of the articulated model object CLAIRE. The dark colored triangles represent the connecting triangles between object-components.

and the rotation matrix

cos R, cos R, sin R, sin R, cos R, - cos R, sin R, cos R, sin R, cos R, + sin R, sin R,

[Rccml] = cos R, sin R, sin R, sin R, sin R, + cos R, cos R, cos R, sin R, sin R, - sin R, cos R,

1 -sinR, sin R, cos R,

which defines the rotation in the mathematically positive direction around the x-, y- and z-axis with the rotation center C(“).

2. I. 1. Shape-initialization

At the beginning of an image sequence, the model world has to be initialized by the image analysis. Model world initialization consists of three parts: camera, illumination and object-initialization. Object initialization is carried out in two steps: shape-initialization and color-initialization. Motion does not have to be initialized.

Shape-initialization generates a rigid 3D shape for each moving real object in the scene. Therefore, it carries out a change detection [13,23, 24, 281 to distinguish between temporally changed and unchanged regions of the two first images of the sequence. Each changed region marks the silhouette of a moving real object together with the background uncovered due to the object motion. For each changed region, one rigid 3D shape represented by a mesh of triangles is generated using a distance transform [23]. Fig. 3(a)-(c) shows the prin- cipal steps for generation of a 3D shape from a given

cos R, cos R, I (4)

silhouette. By the distance transformation, an ellipse is used as a generating function giving the z-coordinate of the shape for each point of the silhouette as a function of its distance to the boundary of the silhouette. The resulting 3D shape is approximated by contour lines (Fig. 3(b)). Contour lines are located on the 3D shape. The distance along the 3D surface between two contours is constant. This enables the algorithm to compute a homogeneous mesh of triangles. Contour lines are then approximated by polygons [lo]. The resulting polygon points are the control points I$’ of a mesh of triangles (Fig. 3(c)). The model object defined by this mesh of triangles is then placed in the model world. Since the 3D shape of an object is completely described by its 2D silhouette, during shape-initialization the shape parameters of an object represent a 2D binary mask which define the silhouette of the model object in the image plane.

By color-initialization, the real image s0 is then projected into the model scene using the geometry of the model camera in order to define the initial color parameters of the new model objects and of the BACKGROUND.

180 G. Martinez / Signal Processing: Image Communication 9 (1997) 175-I 99

object-initialization: (a), (b) and (c)

object-articulation: (d)

(4

Fig. 3. Processing steps from object silhouette to articulated object: (a) object silhouette;(b) contour lines approximating the 3D object shape which was generated by a distance transform assuming a 3D ellipsoidal shape. The contour lines are approximated by polygons; (c) rigid model object represented by mesh of triangles using polygon points as vertices. (d) Subdivision of the rigid model object into flexibly connected model object-components.

2.2. Image analysis

Fig. 4 shows the structure of image analysis that is carried out for each image Sk+ r, k > 0, of the image sequence. Input to image analysis are the current real image sk+r and the model world W; represented by the parameters A$“‘, M$“’ and YJrn) for each model object m. Image analysis consists of 7 parts: image synthesis, change detection, object-articulation, 3D motion estimation, detection of object silhouettes, shape-adaptation and model failure detection.

First, a model image sh of the current model world is computed by means of image synthesis [23]. A change detection mask & + 1 is then calculated using the images s; and Sk+ r. This mask &+ 1 marks the moving objects and the background uncovered due to object motion.

Each rigid model object, which was generated during shape-initialization to describe a real moving articulated object, is subdivided into flexibly connected model object-components by object-

articulation (Fig. 3(c)-(d)). For object articulation, neighboring triangles of the model object, whose 3D motion during the image sequence can be described by similar motion parameters, are considered to be the object-components. The subdivision does not change the topology of the wire frame but assign the control points of those triangles to the new object-component. Object-components remain flexibly connected to each other and are described using individual sets of parameters. In order to detect an object-component, object-articulation requires the analysis of several consecutive frames of the image sequence. Mi’:‘q represent the articulation parameters. They mark for each new model object-component the connecting triangles between it and the other model object-components.

In order to compensate for real object motion and in order to separate the moving objects from the uncovered background included in mask Bk+ 1,

3D motion parameters are estimated for the model objects [23]. The basic idea for the detection of uncovered background is that the projection of

G. Martinez 1 Signal Processing: Image Communication 9 (1997) 175-l 99 181

I Image Synthesis I

Change Detection I

I I I v I

1 1 1 1 Bk+l

MartiC t+,

I I

I 3D Motion Estimation I

I I SMF Ir+, K,

,

t t t Sk+l A k+l Mk+l

Fig. 4. Block diagram of image analysis: A;, M;, Sk stored motion, shape and color parameters; sli+t real image to be analyzed; si, s* model images; Bk+, change detection mask; Ck+, object silhouettes; Mli+ 1 shape parameters for MC- and MF-objects; M;‘:‘; articulation parameters; Sk+ 1 color parameters for MC- and MF-objects. Arrows indicate the information used in various parts of image analysis. Semicircles indicate the output of processing steps.

a moving object before and after motion has to lie completely in the changed area [13, 231. Subtract- ing the uncovered background from mask &+ 1 gives the new silhouette Ck+ 1 of all model objects.

Shape adaptation updates the silhouette of each model object-component of a model object m to the real object silhouette Ciyl. The silhouette of a model object-component differs from the silhou-

ette of the real object-component in two ways. First, when differences between the shape of the real and the model object-component become visible during rotation. Secondly, when new parts of the real object-component start moving for first time, for instance, due to the little motion of Claire [6] at the beginning of the sequence, only parts of the shoulders are detected during the first frames.

To compensate the differences between the silhouettes of the model object-components and c k+l as a result of rotations, shape-adaptation shifts the control points close to the silhouette boundary perpendicular to the surface of each model object-component so that the model object- component gets the required silhouette. These shift vectors give the new shape parameters My:,.

For solving the problem of incorporating new object parts into the shape of the model object, a new mesh of triangles is generated so that the new model object silhouette consists of both, the old silhouette and the new parts indicated by the mask C kfl. In case of the old mesh of triangles had been already articulated, the new object shape is articulated as well. For articulation, the new silhouette of each model object-component is considered to be the old model object-component silhouette and the corresponding neighboring new parts indicated by the mask Ck+ r. The new silhouettes of the object- components and the articulation parameters give the new shape parameters MkM+C1.

For the detection of model failures, a model image s* is synthesized using the stored parameters A;, ML, 9; and the current motion and shape parameters Ak+ 1 and MyF1, respectively. The differences between the images s* and Sk+ 1 are evaluated for determining the areas of model failure. They are represented by MF-objects. These MF-objects are described by 2D shape parameters MkM+F1 and color parameters SYJ,“, only.

2.3. Motion estimation

2.3.1. Motion estimation of rigid objects

In order to estimate the 3D motion parameters of each rigid 3D object, an algorithm has been proposed in [ 16, 171. For motion estimation, it is


supposed that differences between two consecutive images sk and Sk + I are due to the object motion and that the shape of the objects has been already estimated by shape-initialization. The motion estimation method minimizes the mean square luminance difference between the model image s; and the real image sk + r. Therefore, a gradient method is applied which evaluates all visible surface points of each model object m. Each surface point ,!$j) = (&“I, g(j), I(j)) at time instant k is located on the model object surface and is described by its position Plj) = (Pg), P$j), PF’)T, its luminance value 1”’ and its linear luminance image gradients g(j) = (gv’, gp’)T. The luminance value and the linear gradients are taken from the same image from which the color parameters of the 3D model objects were derived. In case of small displacements, for each visible surface point S{j’, the luminance difference AZ”’ between image s; and Sk+ 1 is related to motion by the following linear luminance signal model:

Al’” = sk+ &#)) - s;(pf’) = (g;‘, g$‘)T. (pjjl 1 -pf’),

(5)

+ P. gxlpz

+ F.eIPz

[Cl(j) = - CU’xsx + f’ysy). F/P: - Cpxgx(p, - C,) + pygy(f’y - C,) + pzsdpz - CJI Wf’: + [P,g,(f’x - Cc) + PxgxU’x - CA + PzgxV’z - CA1 .W’: - kx(P, - C,) - sdpx - CJI . F/f’,

+ PzM’z - W/P,2 (6)

+ AZ’j’/P (P - z z C )] . R@“) Y x

+ CP,SY(PX - G) + P,sX(PX - c.4

+ f’,gxU’z - GFIf’i

+ AZ’j’/PJP, - C,)] . Rim’

- Cgx(P, - C,) - gr(Px - CJlFl p z . Rlr”’ + al(j) 9

where A’“’ = (T!“‘, Ty’“‘, Tim), R,f”‘, Rj”‘), R!m’)T are the unknown motion parameters of the model object m and LiZ’j’ is the unknown luminance error at position pp’. The luminance error 61”’ is due to camera noise, motion estimation errors of previous frames and shape estimation error of the 3D model object m.

Considering that AZ”/P”’ % 0, expression (6) can be written in the followinicompact vector form:

AZ(j) = [qj, .A(“) + &z(j),

where [HI(j) is a (1 x 6) output matrix:

L

where pj,!’ and pjlj! I represents the position of surface point SF’ . m the image plane at time instant k and k + 1, respectively. The positions pji” and pj,$ 1 can be calculated by Eq. (1). Combining Eqs. (2) and (5) as well as assuming small rotation angles, Eq. (5) can be linearized into the following linear measurement equation [17, 231:

AZ”’ = P . gx/Pz. Tim)

+ F . g,lPz. T;“’

- [(PXgx + Pygy)F/e + AZ”‘/P,] . T:“’

- CP.Xgx(Py - C,) + PySY(Py - C,)

-1

(7)

(8)

In order to get the unknown motion parameters, Eq. (7) is evaluated for each visible surface point of the model object. This originates the following linear system of equations:

AZ”’ = [HI,,, -@) + 6Z’“,

AZ”’ = [Hlcz, .A(“) + 61”’ >

AZc3) = [HJ3, .Acm) + 8Zc3’ ,

AZc4’ = [HJ4, .Acm) + 6Zt4’ >

(9)


where NiT’ is the total number of visible surface points of the model object m at time instant k. This linear system of equations can be written in the following compact matrix form:

AZ = [Z!z-j ./P’ + 61. (IO)

Modelling the luminance error 61”’ for each surface point Sr’ by a zero-mean Gaussian stationary random process with variance 0;’ = 1, the linear system of Eqs. (10) can be solved using linear regression:

Al(m) = ([WT. [a)-‘. [fq’. AZ, (11)

@N = ,p) _ J(m) > (12)

[PJm) = E[l!PwPT] = ([H-y. [H-J-‘, (13)

where Acrn’ are the estimated motion parameters, Ecm’ the motion estimation error and [PJ,,,’ the covariance matrix of the estimation error.

In order to guarantee the robustness of the motion estimation algorithm, Eq. (7) has to be established only for surface points which do not degrade the performance of the linear regression by the assumed distribution of 61”’ [7, 8, 141. There- fore, a robust technique is used, which exploits the linear signal model assumptions considered for motion estimation [23], see Eq. (5). Since the linear signal model depends on an accurate measurement of the local gradients, the smaller the gradient components, the higher the risk, that this components are generated by noise. Thus, surface points with small local gradients has to be excluded from the linear regression. Secondly, since the used linear signal model is valid only for small displacements, surface points with a large luminance difference AZ”’ has also to be excluded from motion estimation Thus, only surface points where the following threshold decisions apply are taken into account for motion estimation:

(g”‘1 >>o, (14)

IAZ’j’l < aA,, (15)

where aA represents the standard deviation of all residuals AZ [ 131:

I f N!?

aAI = & ,g (AZfi’)2. SP J-1

Surface points for which one of the inequations (14) or (15) is not true are considered as outliers and are consequently removed for motion estimation. Sur- face points which are taking into account for motion estimation are called observation points and are denoted as Ok”‘, w: 1 . . . Ng’.

Finally, due to the linearization, motion parameters have to be estimated iteratively. After every iteration, the model object is moved according to Eq. (2) using the estimated motion parameters. A new set of motion equations is then established, giving new motion parameters updates. In case of convergency the motion parameter updates approach zero during the iterations.

2.3.2. Motion estimation of articulated objects For motion estimation of articulated objects, the

spatially hierarchical coarse to fine approach proposed by Koch [19] is used in this contribution. The shape of the object-components is assumed to be rigid and known, but no spatial constraints between object-components are considered. On the first hierarchy level of this procedure, the object- components are treated as only one rigid model object in order to estimate and compensate the mean motion of object-components. On the second hierarchical level, the motion of each rigid object- component is estimated using observation points located on the surface of the considered object- component only. This motion tends to be small because the mean object-components motion was estimated and compensated previously. This improves the reliability of the 3D motion estimation of each object-component. This procedure is iter- ated until the new motion parameters updates approach zero.

Motion estimation of articulated objects can be improved considering spatial constraints between object-components. In [9,21], spatial constraints are modelled using joints. For motion estimation, Holt [9] decomposes the object into simple articulated subparts. Each subpart contains a small number of object-components. Components conform- ing a subpart are confined to motion within a plane (coplanar motion) and connected to each other by revolute joints [27]. A revolute joint allows only relative angular rotation between components about the revolute joint axis which is perpendicular


to the motion plane. Motion estimation determines first the motion of the most simple subpart(s) and then propagates the analysis to the remaining subparts of the object. In [21], more sophisticated spatial constraints are used. Each subpart contains only one single object-component. Object-components are connected by spherical joints [27] instead of revolute joints. A spherical joint allows non- restricted relative angular rotations between two object-components. Motion estimation determines first the motion of the largest object-component without considering spatial constraints. Then, motion analysis is propagated to the remaining object-components taking into account the joints of the articulated object.

3. A new robust algorithm for 3D motion estimation of a small surface patch

In order to subdivide a model object represented by a mesh of triangles into flexibily connected object-components, object-articulation estimates first the 3D motion parameters of each visible triangle of the model object separately. For motion estimation, the algorithm described in Section 2.3.1 can be applied. However, due to the small surface area of a triangle, the probability of convergency to correct parameters will be low. This section presents an algorithm for motion estimation of a small surface patch which tries to achieve a higher probability of convergency than the algorithm described in Sec- tion 2.3.1. In order to increase the probability of

convergency the proposed algorithm applies a more sophisticated luminance error model and a more robust technique. In addition, the surface area evaluated for motion estimation is increased.

3.1. Robust estimation algorithm

Fig. 5 shows a rigid model object with a surface patch s of which the 3D motion shall be estimated. The 3D motion of a small surface patch s is described by the parameters A’“’ = (Tz’, Ty’“‘, Tp’, R$‘, Ry’“‘, Rp’)T defining its translation and rotation in the 3D space. The surface patch s represents a surface of a 3D model object consisting of Ncs,, control points P#, P#, . . . ,P&%“‘) and q@) triangles. A surface patch may consist of q@) = 1 triangle only. An arbitrary point Pci) on the surface of s is moved to its new position P’ci) according to the motion Eq. (2) described in Section 2.3.1:

p’(i) = [Racy,] . (p(i) - c(s)) + c(s) + y-(s), (17)

with the translation vector TcS) = (TF’, Ti”‘, TI”‘)T, the surface patch center

(18)

and the rotation matrix [&,,I which is defined by the rotation angles Rz’, Ry(s), Rp’, around x-, y- and z-axes with the rotation center C(‘) according to

Eq. (4).

Pf after rotation p”” patch s

after transllation

Fig. 5. Small surface patch s of a rigid 3D-model object and its 3D-motion.

G. Martinez 1 Signal Processing: Image Communication 9 (1997) 175-199 185

In order to estimate the motion parameters the motion estimation method minimizes the mean square luminance difference between a projection of the surface patch’s luminance onto the image plane of the model camera and the corresponding luminance of the current image Sk+ 1. Therefore, the same assumptions and gradient method are used as in Section 2.3.1. However, at time instant k only surface points ,!@’ = (Pp’, g’“‘, I’“‘), v: 1 . Ng, located on the surface patch s are considered. For each surface point $“) at time instant k projected onto the image plane at position&‘, the luminance difference AZ’“’ between image s; and sk + 1 is related to motion according to linearized measurement equation (7):

AI’“’ = [H]( v ) ./I@) + @“‘, (19)

where A@) are the unknown motion parameters of the surface patch s, [a(“, is a (1 x 6) output matrix and 61’“’ the corresponding unknown luminance error at position pp’.

In order to get the unknown motion parameters, Eq. (19) is evaluated for each surface point Sp) of the surface patch s. This originates a linear system of equations. In this section, the luminance error ~51~“) is modeled by a zero-mean Gaussian stationary random process with variance c&. Thus, for each surface point S$“’ the same distribution for 61’“’ is assumed but with distinct variance c$,. In order to solve the linear system of equations considering this new luminance error model, a minimum variance estimator [4] has to be applied. For convenience, a minimum variance recursive estimator, i.e. Kalman filtering [4], is used. In this case, the equations of the Kalman filter can be written as follows:

[%, = m”- 1) - Pq”, * wl~“)~ cq- l), (21)

A{;', = A[:)-,, + [K-J") * [ru(") - [H](").A(;)_ I)],

(22)

where Al{:! and AI,‘:)_ r) are the new and old prediction of the unknown motion parameters A@) corresponding to the vth and (v - 1)th step of the Kalman

filtering, respectively. [IyI(,) represents the correc- tion matrix. [PJCU_ i) and [P](“, describe the old and the new covariance matrix of the estimation error Eli)_ 1j and E$‘,, respectively:

E$- 1) = (A(S) - a{;)- 1))

* [PI@_ 1) = E[@_ r,.E(;‘TrJ (23)

E$j = (A(S) - A^${) 5 [PI@, = E [Eli’, * Egl’] . (24)

The Kalman filter is spatially applied to each surface point ,!$“) of the surface patch one time only. Thus, the first and last iteration of the Kalman filtering correspond to the .!$‘) and .!?kNz) surface points, respectively. In addition, no particular order is taken into account. For the first iteration of the Kalman filtering, A:$, and [PJo, are calculated applying Eq. (11) and Eq. (13) respectively. In this case [a (see Eq. (10)) is obtained evaluating observation points on the surface patch s only.

In order to guarantee the robustness of the motion estimation algorithm, only surface points St) which do not degradate the performance of the Kalman filtering by the assumed distribution of 61“” are used. Therefore, the recursive property of the Kalman filtering is exploited. In this new technique, only those observation points where the local luminance error model was found to be valid are used by the Kalman filtering. In order to find out whether the luminance error model is valid for a surface point Sp’, the mean square error before and after the current step of the Kalman filtering, MSE’“‘@$_ 1J and MSE’“‘(~,‘$), respectively, is evaluated. The luminance error model is considered to be valid if the following inequation applies:

MSE’“‘($;) < MSE’“‘(#_ r)), (25)

If the luminance error model failed, the last prediction AI’“)_ (Lf 1) and the corresponding covariance matrix [PICv_r, are considered for the next step v + 1 of the Kalman filtering instead of the new prediction ${ and [P](“,. Thus, the current surface point St) is excluded from motion estimation. Sur- face points where the luminance error model was found to be valid are called observation points and are denoted as Ohf’,f: 1 . . . Ng’.

186 G. Martinez 1 Signal Processing: Image Communication 9 (1997) 175-l 99

3.2. Preselection of surface points

Since the evaluation of the MSE is a very time consuming task, applying Eq. (25) to all surface points of the surface patch increases the computational time. In order to solve this problem, Eq. (26) and the robust technique described in Section 2.3.1 are combined. Thus, Eq. (26) has to be evaluated only for those surface points which satisfy the following inequations:

1 (I(“’ I>> 0, (27)

IAP”)l < a&), (28)

where a&’ represents the standard deviation of all residuals AP”):

In case Eq. (27) or (28) does not apply, the surface point is excluded from motion estimation without evaluating Eq. (26). This reduces the computational time required by the proposed robust technique.

3.3. Luminance error model

In order to estimate the 3D motion parameters A@) of a surface patch s using the measurement Eq. (19), the luminance error 61’“’ is modelled by a random process. In Section 2.3.1, for each surface point Sp’, 61’“’ is modeled by a zero-mean Gaussian stationary random process with common variance o$, = 1. In Section 3.1, for each surface point Sj”’ the same distribution is assumed but with distinct variance r&. This section describes a method to calculate the value of CJ$,, for an arbitrary surface point Sp)

For each surface point S$“’ = (Pj”), g’“‘, I’“‘) the luminance error model considers both, the luminance error due to the shape estimation error M(O) of the 3D-model object at position PC’ as well as the luminance error due to camera noise. The shape error and the camera noise are supposed to be statistically independent. As a first approach, the shape error AZ’(“) is modelled by a Gaussian stationary random process describing the shape error of the model object at position Z$“’ in the X, y and

z directions. These errors are assumed to be uncorrelated and stationary, with mean zero and the variance &,,, = a&. The resulting covariance matrix can be written as follows:

1 0 0 [CtiUl] = oh. 0 1 0

[ 1 , a$ = const. (30) 0 0 1

In order to compute the luminance error due to a shape error AZ”“‘, only its projection A&’ in the direction of the luminance gradient on the image plane is considered [ 1) (Fig. 6). Therefore, the shape error AP”’ is mapped to a vector A#“) onto the image plane by using the following linear transformation of the model camera:

rF _ - F.P$” 1

coordinates of the 3D model world

_) x

at position Pt’ = (it), p$“), p$“)J

Fig. 6. Projection of the shape estimation error AP’“’ of an observation point Sp) in the direction of its luminance gradient g’“’ on the image plane. The surface point Sp’ is located on the 3D model object surface at the position I$:’ = (P’,“), Pp’. Pp’).

G. Martinez / Signal Processing: Image Communication 9 (I 997) 175-I 99 187

Then, the vector Ap w is projected onto the unit luminance gradient vector g(“)T/lg(“)l measured at point pp’ in order to get A&‘,

= (py . 1 g’“’ ( F .[ _(g;)$~~~).p~)~.~(;‘, A# = [pq(“) . Ap. (32)

The resulting luminance error variance c$,,~, can be written as

&MC = Is( * CW~“, * C&PI * wl;,> (33)

F2 4w, = (@)4 -. a$ [(gy qy')'

+ p32) + 2. gjyv). gl”‘. pl;v) . p:“’

+ (g:“‘)2. ((Pl”q2 + (P$q2)]. (34)

The camera noise is supposed to be a Gaussian uncorrelated zero-mean noise with variance 0;. Finally, the variance of the luminance error model is

06, = &M, + 0:. (35)

3.4. Selected neighborhood for motion estimation

In order to improve the reliability of the proposed algorithm for 3D motion estimation of a small surface patch, the surface patch and its neighboring triangles are taken into account for 3D-motion estimation, see Fig. 7. However, this will improve the reliability only if the surface points of the neighboring triangles and the surface points of the surface patch itself exhibit the same motion, i.e. belong to the same object-component. In order to minimize the probability that neighboring triangles from different object-components are chosen, for each surface patch the neighboring triangles are selected by a coarse 2D segmentation of a displacement vector field in the image plane. The regions found by the coarse 2D segmentation represent approximately the silhouettes of the object-

Woarse 2D

Fig. 7. Selection of a neighborhood around a small surface patch, i.e. triangle, using a coarse 2D segmentation The coarse 2D segmentation is based on a 2D segmentation of a pel-wise displacement vector field (DVF) into regions of homogeneous displacement. The 2D segmentation is carried out only inside of the silhouette of the 3D model object. For 2D segmentation, a Maximum Likelihood Thresholding method based on population mixture models is used. The selected neighborhood is used to increase the reliability of the estimates.

components. Then, in order to estimate the 3D motion parameters of a surface patch, a neighboring triangle will be used only if both, the surface patch and the neighboring triangle belong to the same region found by the coarse 2D segmentation. In this contribution, the coarse 2D segmentation is based on a 2D segmentation of a pel-wise displacement vector field (DVF) into regions of homogeneous displacement. The 2D segmentation is carried out only inside of the silhouette of the 3D model object. The DVF is estimated by hierarchical block matching [2] using the current image sk+ I and the previous image s,+. For 2D segmentation, a Max- imum Likelihood Thresholding method based on population mixture models [18,20,26] is used. The resulted regions are then improved by considering linear gradients near their boundaries.


4. Object-articulation

As it was pointed out earlier, 3D objects in the real world may be articulated, i.e. may consist of several flexibly connected rigid object-components. For coding, the parameter sets describing the shape of real articulated objects has to be estimated. For shape estimation, the shape of each real articulated object is first modelled by only one rigid model object represented by a mesh of triangles. Then, the rigid 3D model object is subdivided into the flexibly connected rigid model object-components by object-articulation.

For object-articulation, the rigidity constraint imposed on each rigid object-component is exploited. The rigidity constraint states that the distance between any pair of particles of a rigid object-component remains constant at all times and configurations [27]. According to this constraint, the motion of a rigid model object-component represented by a rigid mesh of triangles can be completely described by using 6 motion parameters, see Section 2.3.1. In case of a rigid 3D model object representing a real articulated object, the triangles of its wireframe covering the visible surface of one unknown object-component have to exhibit similar 3D motion parameters. Therefore, for object articulation neighboring triangles which exhibit similar 3D motion parameters are clustered into patches. In an ideal case, these patches will represent the complete visible surface of the moving object-components of the articulated object. How- ever, due to the unreliability of motion estimates for single triangles, only parts can be found. In addition, if an object-component was not moving at the time interval considered for motion estimation, it will not be articulated. Therefore, in order to find all object-components completely, a frame to frame clustering method which considers clustering results of previous frames is proposed and presented in this section.

4.1. Frame to frame clustering method

The frame to frame clustering method consists of four steps which are applied to each frame of the image sequence. By the first step, 3D motion estimation and compensation for the 3D model

object is carried out. By the second step, the 3D motion parameters of each visible triangle of the 3D model object are estimated. The 3D motion parameters of neighboring triangles are then compared and triangles with similar 3D motion are clustered into patches, see Fig. 8. Two neighboring triangles are clustered into the same patch if the following criterion applies:

1 Wp’ - Wpy

( wp + wy Gtth, VW=T,Randc=x,y,z,

(36)

where T$‘), Ty(si’, Tp’, Rp), Ry(si’ and RF’ are the estimated motion parameters ACSi) of the triangles i = 1,2. The motion parameters are first referred to a common point before the criterion of Eq. (36) is evaluated. The threshold th is considered constant

Triangle 1

ioin criterion patch (2nd step)

joined triangles

Fig. 8. Clustering of neighboring triangles which exhibit similar

3D motion parameters into a patch. This corresponds to the

second step of the frame to frame clustering method. Patches are

indicated by different grey levels.


during the sequence and its heuristic value for test sequences with spatial resolution CIF (10 Hz) is 0.4.

The third step tries to cluster neighboring patches found by the second step into larger patches. Therefore, because motion estimation for large patches is more reliable, the 3D motion of each patch is estimated. The 3D motion in this case of neighboring patches is then compared in order to determine if they can be joined to form a new larger patch (Fig. 9). For 3D motion estimation the method described in Section 3 is used without neighborhood. In order to compare neighboring patches, the same criterion (36) with th = 0.3 is used. The patches found by the third step at time instant k + 1 are denoted as Patchpl , (u: 1 . . . N,,,).

Fig. 9. Clustering of patches into a larger patch. This

corresponds to the third step of the frame to frame clustering

method.

By the fourth step, clustering results obtained by previous frames are updated considering the clustering results obtained by the current analysis. Therefore, a patch-membership-memory is at- tached to the triangles of the wire frame, i.e. the triangle’s membership to a patch is stored with each triangle. The patches obtained by the third step of the current frame are used either to define a new patch in the patch-memory or to update patches stored already in the patch-memory. The patches in memory are denoted as Patch::, (h: 1 . . N,,,,,). At the beginning of an image sequence the patch-memory is empty (N,,,,, = 0).

(‘) A patch Patchk + 1 found by the current analysis is stored as a new patch in the patch-memory, if it does not share triangles with any of the patches stored already in memory and if it improves motion compensation. In order to determine if a patch Patchrir improves motion compensation, a criterion based on the evaluation of the MSE after motion compensation of the following patches is used: 1. the patch Patchfli,, 2. the patch Patch, consisting of all visible tri-

angles of the model object which do not belong to Patch?4 1,

3. The patch Patch, consisting of all visible triangles of the model object.

For each patch, the 3D motion parameters are first estimated. For motion estimation the algorithm proposed in Section 3 is used without neighborhood. Each patch is then temporary motion compensated and the MSE between a projection of its luminance onto the image plane of the model camera and the corresponding luminance of the current image sk + 1 is measured. After measuring, the patch is moved back to its old position. A patch Patchpi 1 improves motion compensation if the following inequation applies:

MSE”(A’“‘) + MSE”(A’“‘) < MSEb(,4’b’), (37)

1 (A,l’j’)’ + VS:“E Patch:: t A’“’ arl”;.t& (Az”‘)2 R ,p

d 1 (Al(j))’ , (38)

190 G. Martinez / Signal Processing: Image Communication 9 (I 997) 175-I 99

where A’“‘, A(‘@ and Atb’ are the estimated motion parameters of Patchpi 1, Patch, and Patch,, respectively, and MSE”(A’“‘), MSE”(A’“‘) and MSEb(Acb’) the corresponding mean square errors measured after motion compensation.

4.2. Patch-updating

For updating of the patches in the patch-memory, a correlation-like measure for each patch Patchpi 1 found by the current analysis with respect to each patch in memory is evaluated. If a patch Patchpi and a patch in memory Patch&, are highly correlated, the patch Patchrli is accepted for patch-updating. Patch-updating clusters triangles of the patch Patch?; 1 to the patch in memory Patch:!,, see Fig. 10.

The correlation of two patches is measured by the similarity in terms of position and size. Two patches have similar position if they share common triangles. Because the mesh of triangles is homogeneous (see Section 2.1.1) the size of a patch is considered to be the number of triangles forming its surface. For measuring of correlation two cases are distinguished (see Fig. 11):

1. If the size N,,, of the patch Patchpl 1 is larger than the size N,,, of the patch in memory Patchgk,, the correlation of both patches is considered to be high if the following inequations are satisfied:

INnlem - Rharedl > thl

N ” mem (39)

INnem - NsharedI < th2

N \’ mem (40)

2. If N(u) 6 Wn,,, the patches Patchpi and Patch!$,, are considered to be highly correlated if the following inequation applies:

Wmem - NsharedI < th3

N ” mem (41)

N shared and Shared is the number of shared and nonshared triangles of Patchpl I with respect to Patch&,,

- where NW = Nshared + &are& see

Fig. 11. The heuristic values of thl, th2 and th3 for test sequences with spatial resolution CIF (10 Hz) are 0.7, 0.2 and 0.4, respectively.

For patch-updating, the proximity of the noncommon triangles of an accepted patch Patchpi to

patch in memory Patch$,,

patch found by the

n I

frame k+ 1

Fig. 10. Updating of a patch in memory Patch$, using a patch Patchpl 1 found by the current frame k + 1. In this example, all noncommon triangles of the patch Patch MI were clustered to the patch in memory.


patch in memory Patch$,,,

(size : N,,,&

Surface of the patch Patchfi j

outside the patch in memory (size : Nshshard)

patch found by the current analysis

Pafch:: ,

(size : NC”,)

common surface (size : Nshared)

Fig. 11. Surfaces evaluated by the acceptance criterion for updating a patch in memory Patch:!,,, using a patch Patchkl, found by the current frame k + 1.

the patch in memory Patch:&, is taken into account (see Figs. 12(a) and (b)). First, noncommon triangles close i.e. adjacent to Patch&, and the triangles of Patch:&, are considered to belong to the same object-component, see Fig. 12(c). There- fore, they are immediately clustered to Patch:!,. Secondly, noncommon triangles which are far i.e. not adjacent to Patch!&, are considered to belong probably to different object-components. There- fore, they are first evaluated in order to determine if they will be clustered too. For evaluation, they are clustered into uncertain patches denoted as Patch& r = 1 . . . N,,,, see Fig. 12(d). An uncertain

(a) patch in memory Patchzm

(b) patch found by the current analysis

Patch::,

(c) adjacent triangles (d) uncertain patch Parch$

(e) Patch, (4 Patch,

Fig. 12. Detection and handling of uncertain patches.

patch Patch::, is excluded of patch-updating if it has at least one triangle belonging to a different object-component. In order to determine if an uncertain Patch::, contains triangles belonging to different object-components, a criterion is used which evaluates the MSE after motion compensation of the following patches: 1. The patch Patch, consisting of the triangles of

the patch in memory Patch:!,,, and the triangles of the patch Patchpi 1 adjacent to Patch:!,, see Fig. 12(e).

192 G. Martinez / Signal Processing: Image Communication 9 (1997) 175-l 99

2. The patch Patch, consisting of all visible triangles of the model object which do not belong to Patch,.

3. The Patch, consisting of the triangles of the patch Patch, and the triangles of the uncertain patch Patch::,, see Fig. 12(f).

4. The Patchs consisting of all visible triangles of the model object which do not belong to Patch,.

For each patch, the 3D motion parameters are first estimated applying the algorithm proposed in Sec- tion 3 without neighborhood. Each patch is then temporary motion compensated and the MSE between a projection of its luminance onto the image plane of the model camera and the corresponding 1UIninanCe Of the CUIXtIt image Sk+ t iS measured. After measuring, the patch is moved back to its old position. Let A@‘, Atd), Ate), ACs’ be the estimated motion parameters and MSE’(A”‘), MSEd(ACd’), MSE’(A’“‘) and MSEf(Ao)) be the measured mean square error after motion compensation of the patch Patch,, Patch,, Patch, and Patchf, respectively. Thus, all triangles of a patch Patch$ are

also clustered to the patch Patch&, in memory, if the following inequation applies:

MSE’(A’“‘) + MSE’(A”‘)

d MSE’(A’“‘) + MSEd(ACd’), (42)

+ c (AI(j))2 A“” VSlj’t- Patch, A”

6 1 (AI”‘)’ + c (AZ”‘)’ .

VS:“t Patch, A”’ V S:‘, Patch,, /p’

Otherwise, the uncertain patch PatchfAC is not considered for patch-updating.

4.3. Detection of object-components from the

patch memory and articulation

As soon as a patch in the Patch-memory Patch:!, is not changed during more than n suc-

cessive updates, it is detected as an object-component, if it improves 3D motion compensation, see Fig. 13. The heuristic value of n considering test

k+l k+2 k+3 k+4 k+5 k+6 k+l

image number

patch in memory

Fig. 13. Detection of an object-component from patch-memory. Since the patch in memory Patch:!,,, was not changed during more

than 2 successive updates k + 4 and k + 5, it was detected as an object-component at time instant k + 6.


sequences with spatial resolution CIF (10 Hz) is 2. In order to determine if a patch Patch:!,,, improves motion compensation, Eq. (37) is applied.

The 3D model object is then articulated into two model object-components. The articulation does not change the topology of the wire frame, but assigns the corresponding control points to the new model object-component. The model object-components remain flexibly connected to each other. Each object-component is described using three individual set of parameters defining its motion, shape and colour. In the following frames, the model object-components may be articulated into more object-components.

5. Parameter coding

The task of parameter coding is the efficient coding of the parameter sets motion, shape, and color provided by image analysis. For MC-objects, motion Al?, and shape parameters ML’!? 1 are coded like in [23]. The articulation parameters M;‘$ are losslessly coded. Since they have to be transmitted only once for each articulation the additional bit-rate required is negligible. For MF- objects, color parameters and 2D shape parameters are also coded like in [23]. Motion parameters of all MC-objects are transmitted first. Then, shape parameters of MC-objects are transmitted. Finally, the shape and color parameters of MF-objects are transmitted.

6. Experimental results

6.1. Reliability of 30 motion estimation

6.1. I Using a synthetically generated image

sequence

To examine the reliability, i.e. probability of convergency and the accuracy of the developed algorithm for 3D motion estimation of a small surface patch, a synthetically generated image sequence is used. Each frame of this sequence was generated by moving a 3D-model object m one pel in both x and y directions and projecting its color parameters onto the image plane of a model camera. Here only one triangle builds a small surface patch.

To evaluate the probability of convergency, the maximal norm of the error of the estimated translation parameters 1) e$“) )I and the maximal norm of the error of the estimated rotation parameters 11 ek”’ 11 are used:

IleY~ll =

Il&?l =

Here A@) = (T$“‘, Tr’, Tim’, R!J”, R:“‘, RL”“)T are the true motion parameters which were used for sequence generation and A(W) = ( fkw’, ?p’, p$“‘, T(W) F(“‘) ?iw’)’ are the estimated motion para- m&&s if’s visible triangle TRI’“‘, w: 1 . . . 208, of the model object m. In order to examine the probability of convergency, it is assumed, that by motion estimation of an arbitrary triangle TRI@“’ the range of convergency is reached, if the values of Ile$“l/ and IIek”‘II are smaller than the heuristic thresholds thr = 0.5 pel and th, = 0.5”, respectively.

For the experiment, only the first two frames of the synthetically generated image sequence are used. In addition, the camera noise is considered to be zero. Thus, the luminance error results from the shape error only. Then, for each visibly triangle TRI’“’ of the model object, the 3D motion parameters _@“) are estimated. For motion estimation both, the proposed algorithm described in Chapter 3 (algorithm 1) and the algorithm described in Section 2.3.1 (algorithm 2) are applied separately.

Let N1 and N2 be the number of triangles for which the rate of convergency was achieved by the algorithm 1 and algorithm 2, respectively. Consid- ering that the model object consists of 208 visible triangles, the probability of convergency can be measured by using the following equations:

Pr =s and P2=g,

where PI applies to the algorithm 1 and P2 to the algorithm 2.


Experimental results show that the probability of convergency is Pi = 0.8076 by algorithm 1 and P2 = 0.1057 by algorithm 2.

The proposed algorithm for motion estimation does not reach the range of convergency particularly by those triangles with weak texture, because in this case both, the spatial linear gradients and the reliability of the estimates are low. In addition, in case of small objects, i.e. objects which cover small image regions, the probability of convergency to correct motion parameters of the proposed algorithm becomes low.

To evaluate the accuracy, the estimation error variance for each estimate is used. Experimental results for each parameter are shown in Table 1. The average of the estimation error variance of the translation parameters c?& was found to be 2.7694 pel’ by algorithm 1 and 25.8495 pel’ by algorithm 2. The average of the estimation error variance of the rotation parameters & was found to be 0.841602 by the algorithm 1 and 3.703302 by algorithm 2.

Since the criterion for estimating 3D-motion is based on the minimization of the mean square luminance difference (MSE), an improvement of the probability of convergency and of the accuracy can also be evaluated comparing the MSE after motion compensation for each surface patch of the 3D- model object:

MSEl G = - 10logMSE?. (46)

MSEl applies to the algorithm 1 and MSE2 to the algorithm 2. G represents the gain on MSE achieved by the algorithm 1. Experiments using the first two images of the synthetic sequence show an average gain G over all visible triangles of the model object of 10.2710dB. Fig. 14 shows the MSEl and MSE2 for all 208 triangle of the 3D- model object.

6.1.2. Using a real image sequence

Results using the 1st and the 2nd images of the real test sequence ‘Claire’ with spatial resolution CIF and a frame rate of 10 Hz show an average gain G of 4.4734 dB on MSE as a result of using the algorithm 1 instead of algorithm 2. Here the 3D model object was automatically generated using the algorithm described in Section 2.1.1. Fig. 15 shows the MSEl and MSE2 for all 149 triangle of the 3D-model object. Using the first 10 frames of the same image sequence, the average gain G was found to be 3.531 dB.

6.2. OBASC coding eficiency

In this section OBASCR3n and OBASC& with the developed algorithm for object-articulation (OBASC&) are applied to the test sequences Claire [6] and Miss America [S] with spatial resolution CIF and a reduced frame rate of 10 Hz. 0BASCR3,, and OBASC& differ only according to object-articulation. For parameter coding, a data rate of approximately 64 kbit/s is used. In

Table 1 Mean of the estimates and the estimation error variances by the proposed algorithm (algorithm 1) and the algorithm described in [23] (algorithm 2) using the first two images of the synthetically generated sequence

True values

Ry’ = 0.01 degree R’I’ = 0.01 degree Rt) = 0.01 T:“’ = 1.0 degree TC) = 1.0 pel TF’ = 0.01 pet pel

Mean of the estimate algorithm 1

0.0123 0.0056 0.0123 0.9539 1.0485 0.4171

Estimation error variance (degree’, pel’) algorithm 1

0.8745 1.5620 0.0884 1.8972 1.3036 5.1075

Mean of the estimate algorithm 2

0.4163 0.4227 0.0465 0.9099 2.2862

16.2321

Estimation error variance (degree*, pel’) algorithm 2

5.4063 4.1201 1.5836 5.1500 6.4887

65.9098


MSE

29

25

20

15

10

5

0

triangle’s number

Fig. 14. Mean square luminance error (MSE) after motion compensation for each triangle of the 3D-model object using the first two frames of the synthetically generated image sequence. For 3D motion estimation, the proposed algorithm (algorithm 1) and the algorithm described in [23] (algorithm 2) were applied. MSEl applies to algorithm 1 and MSE2 to algorithm 2.

MSE

triangle’s number

Fig. 15. Mean square luminance error (MSE) after motion compensation for each triangle of the 3D-model object using the first two frames of the real image sequence ‘Claire’ (CIF, 10 Hz). The 3D model object was automatically generated using the algorithm described in Section 2.1.1. For 3D motion estimation, the proposed algorithm (algorithm 1) and the algorithm described in [23] (algorithm 2) was applied. MSEl applies to algorithm 1 and MSE2 to algorithm 2.

addition, no bit-rate control is implemented. In the experiment, the allowed noise level N, for detection of model failures was set to 6/255. Color parameters of model failures are coded with a PSNR of 36 dB. In all experiments, the coders are initialized using the first original image of the sequence.

Experimental results show that OBASC& achieves a realistic articulation of the model object ‘body’ into flexibly connected model object-components ‘head’ and ‘shoulders’ without a priori knowledge about the scene content. For the test

sequences ‘Claire’, ‘Akiyo’ and ‘Miss America’, the flexibly connected object-components were found in average after 6 frames, see Fig. 16.

The area of model failures obtained by OBASCRsD is on average below 4% of the image area [23]. The average area of model failures obtained by OBASC& is 3 % of the image area. This reduction is achieved due to the more realistic object- articulation only, see Fig. 17.

Table 2 compares the average bit-rate for different parameter sets defining motion, shape and


03

(cl

Fig. 16. Object articulation achieved applying OBASC&, to (a)

test sequence ‘Claire’ [6], (b) test sequence ‘Akiyo’ (MPEG4-

Cla ss a) and (c) test sequence ‘Miss America’ [5]. The flexibly

con netted object-components ‘head’ and ‘shoulders’ were found

in a verage after 6 images. No knowledge about the scene content

was used.

pel 6000

0 10 20 30 40 50

Frame

Fig. 17. Area of model failures obtained by OBASCa)n and

OBASC&n in percent of the image area for the test sequence

Claire. The total area is 101376 pel. The average area of mode1

failures is 3.7% of the image area for OBASC&n and 2.9% for

OBASC&n.

Table 2

Average bit-rate of parameter sets for OBASC,,, and

OBASC,*,, (all figures in bit/frame). Both OBASC,,n and

OBASC;,, use the same algorithm for detection of model fail-

ures. The bit-rate for coding of color parameters is 1.2 bit/pel

OBASC Motion MC MF C shape MF-color IS

shape shape

OBASC 200 500 1150 1650 4500 6350

R3D

OBASC* 200 500 1000 1500 3600 5300

R3D

color obtained by OBASCssn and OBASC&. For both, OBASCR3n and OBASC& coding of motion parameters requires 200 bit/frame. OBASC& decreases the bit-rate for coding of shape parameters of MF-objects. This is due to a smaller size of

MF-objects achieved by OBASC&n. Using

1.2 bit/pel for coding of color parameters, the over- all bit-rate is reduced from 6350 bit/frame to 5300 bit/frame. Since the same image quality measured by SNR = 36 dB of the encoded color parameters was fixed for both coders, OBASC&n is superior to OBASC& for coding of image se-

quences at low date rates. Fig. 18 shows the 33rd decoded frame of the test

sequence Claire obtained by OBAS&n and

G. Martinez 1 Signal Processing: Image Communication 9 (1997) 175-199

Fig. I ,8. 33rd decoded frame of test sequence Claire using the data rate of (a) 64 kbit/s for OBASCR3D and (b) 53.5 kbitis for OBAS(

a) b)

OBASC&,. Subjectively, there is no difference between these two decoded image sequences, although OBASC& requires 16% less bits for coding.

6.3. Increase of the computational complexity

In order to evaluate the additional complexity introduced by the new shape estimation algorithm of articulated objects over the previous work [3], the computational time is evaluated. Experimental results show applying the test sequence Claire with spatial resolution CIF and a reduced frame rate of 10 Hz that the computational time increases in average for each frame by less than 50%. A Sun SPARCstation 20 was used for the experiment. This increase of computation time is because of the more sophisticated 3D motion estimation technique use for motion estimation of triangles.

7. Conclusions

For transmission of moving images at very low bit-rates, object-based analysis-synthesis coding using the source model of ‘moving articulated 3D

objects’ is investigated. For coding, the parameter sets describing the object-components have to be estimated. In order to estimate the shape of object- components three steps are applied: shape- initialization, object-articulation and shape-adaptation. In this contribution a new algorithm for object-articulation has been developed.

Goal of object-articulation is the subdivision of a rigid 3D model object represented by a mesh of triangles into flexibly connected model-object components. Therefore 3D motion parameters for each triangle of the model object are estimated separately and neighboring triangles which exhibit similar 3D motion parameters are clustered into patches.

For estimation of 3D motion parameters of a triangle it is assumed that the shape of the real object has already been estimated by shape-initialization. For motion estimation, a minimum variance recursive estimation, i.e. Kalman filtering, is carried out using a more sophisticated luminance error model. The luminance error model considers both the luminance error due to the shape estimation error of the 3D model object as well as the luminance error due to the camera noise. The shape error and the camera noise are assumed to be statistically independent. The shape error is modelled by a Gaussian stationary random process describing


the shape error in the x, y and z directions, The camera noise is supposed to be a Gaussian uncorrelated zero-mean noise. For improving the reliability of 3D motion estimation of a triangle also selected neighboring triangles around it are evaluated by the estimation algorithm. In addition, a more robust technique is applied, which exploits the recursive property of the Kalman filtering. For each triangle and its selected neighboring triangles, the estimation method minimizes the mean square luminance difference between a projection of their luminance component onto the image plane of a model camera and the corresponding luminance of the current image to be analyzed.

In order to measure the reliability of 3D motion estimation of a triangle the probability of convergency to correct estimates was used. Experiments using a synthetically generated image sequence show that the probability of convergency is greater than 0.8076. Comparing this algorithm with the basic algorithm described by Ostermann 1231, the probability of convergency increases from 0.1057 to 0.8076. Furthermore, the accuracy of both algorithms has been compared. The accuracy of the estimates is measured by the estimation error variance. Experimental results show that the average of the estimation error variance for the translation parameters improves from 25.8495 to 2.7694 pel’ and the average of the estimation error variance for the rotation parameters improves from 3.703302 to 0.841602.

For clustering of neighboring triangles, a frame to frame clustering method which considers clustering results of previous frames has been developed. For each frame of the image sequence, this method clusters into patches neighboring triangles which exhibit similar 3D motion. A patch-memory is at- tached to the triangles of the wire frame, i.e. the triangle’s membership to a patch is stored with each triangle. The patches obtained by the current frame are used either to define a new patch in the patch-memory or to update patches stored already in the patch-memory. At the beginning of an image sequence the patch-memory is cleared. As soon as a patch in the patch-memory is not changed during more than two successive updates, it is detected as an object-component if it improves 3D motion

compensation. The performance of the clustering algorithm is evaluated in combination with the object-articulation in an OBASC scheme.

The developed algorithm for object-articulation has been incorporated in the image analysis of 0BASCR3u [23]. Typical ‘head and shoulders’ videophone test sequences assuming spatial resolution CIF with a reduced field frequency of 10 Hz have been applied. Compared to OBASCR3n without the developed algorithm for object-articulation, the average area of the model failures decreases from 4% to 3%. Maintaining the same picture quality measured by SNR = 36 dB this reduction of the average area of model failures correspond to a reduction of the transmission rate from 63.5 to 53 kbit/s. The simulation results show that a realistic object articulation of the model object ‘body’ into flexibly connected model object-components ‘head and ‘shoulders’ can be achieved without a priori knowledge about the scene content. The flexibly connected object-components were found in average after 6 frames.

Until now the proposed algorithm for shape estimation of articulated object has been investigated with typical ‘head and shoulders’ video test sequences, where object-components are large and no occlusions occur. In case of small object-components consisting of only a few picture elements shape estimation can fail, because the probability of convergency to correct motion parameters of the applied estimation algorithm becomes low. Therefore, the proposed algorithm for shape estimation of articulated objects has to be extended to consider small object-components and mutual occlusion.

At present, this extended algorithm is being developed. In order to consider small object-components also the 2D motion parameters T,, T,, R, and the displacement vector D = (D,, DJT for each triangle are estimated. For clustering, additionally to the 3D motion parameters of each triangle, the 2D motion parameters and the displacement vector as well as the color of each triangle, are also taken into account. In case of occlusion, the shape of the object-components in foreground and the shape of the occluded object-components are represented by separated wireframes.


Acknowledgements WI

The author wishes to thank Prof. Musmann for encouraging this work. Furthermore, the author wishes to thank Dr.-Ing. J. Ostermann for his soft- ware support and Dipl.-Ing. J. Stauder and Dipl.- Ing. M. Kampmann for fruitful discussions on image analysis.

Cl31

Cl41 Cl51

References

Cl] R. Berger, Stereoskopische Bewegungsschatzung unter Beriicksichtigung einer fehlerbehafteten Objektgeometrie, Diploma Thesis, University of Hannover, Germany, 1993.

VI

c31

c41

c51

C61

c71

PI

c91

Cl01

Cl11

M. Bierling, “Displacement estimation by hierarchical blockmatching”, Proc. 3rd SPIE Symp. on Visual Commun- ications and Image Processing, Cambridge, November 1988, pp. 942-951. H. Busch, “Subdividing non rigid 3D objects into quasi rigid parts”, Proc. IEE 3rd Internat. Conf: on Image Pro- cessing and Applications, IEE Publ. 307, Warwick, UK, July 1989, pp. 14. K. Brammer and G. Siffling, Kalman-Bucy-Filter - Deter- ministische Beobachtung und Stochastische Filterung, Oldenbourg, Miinchen, 1985, Chapter 2, pp. 60ff. British Telecom Research Lab (BTRL), “Test sequence Miss America, CIF, 10 Hz, 50 frames”, Martlesham, Great Britain. Centre National d’Etudes des Telecommunication (CNET), “Test sequence Claire, CIF, 10 Hz, 156 frames”, Paris, France. D.C. Hoaglin, F. Mosteller and J.W. Tukey, Understand- ing Robust and Exploratory Data Analysis, Wiley, New York, 1983. R.V. Hogg, An Introduction to Robust Estimation, Academic Press, New York, 1979. R. Holt, A. Netravali, T. Huang and R. Qian, “Determin- ing articulated motion from perspective views: A de- composition approach”, Proc. IEEE Workshop on Motion ofNonRigid and Articulated Objects, Austin, Texas, 11-12 November 1994, pp. 126-137. M. Hotter, “Predictive contour coding for an object- oriented analysis-synthesis coder,” IEEE Internat. Symp. on Information Theory, San Diego, CA, January 1990, p. 75. M. Hotter, “Object-oriented analysis-synthesis coding based on moving two-dimensional objects”, Signal Pro- cessing: Image Communication, Vol. 2, No. 4, December 1990, pp. 409428.

P61

Cl71

Cl81

Cl91

c201

WI

I221

~231

c241

M. Hotter, “Optimization and efficiency of an object- oriented analysis-synthesis coder”, IEEE Trans. Circuits Systems Video Technol., Vol. 4, No. 2, April 1994. M. Hotter and R. Thoma, “Image segmentation based on object oriented mapping parameter estimation”, Signal Processing, Vol. 15, No. 3, October 1988, pp. 315-334. P.J. Huber, Robust Statistic, Wiley, New York, 1981. M. Kampmann and J. Ostermann, “Automatic adaptation of a face model in a layered coder with an object-based analysis-synthesis layer and a knowledge-based layer”, Signal Processing: Image Communicafion, Vol. 9, No. 3, March 1997, pp. 201-220. F. Kappei and G. Heipel, “3D model based image coding”, Picture Coding Symposium (PCS ‘SS)“, Torino, Italy, Sep- tember 1988, p. 4.2. F. Kappei and C.-E. Liedtke, “Modelling of a natural 3-D scene consisting of moving objects from a sequence of monocular TV images”, SPIE, Vol. 860, Cannes, 1987. J. Kittler and J. Illingworth, “Minimum error thresholding”, Pattern Recognition, Vol. 19, No. 1, 1986, pp. 4147. R. Koch, “Dynamic 3-D scene analysis through synthesis feedback control”, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 15, No. 6, June 1993, pp. 556568. T. Kurita, N. Otsu and N. Abdelmalek, “Maximum likelihood thresholding based on population mixture models”, Pattern Recognition, Vol. 25, No. 10, 1992, pp. 1231-1240. G. Martinez, “3D motion estimation of articulated objects for object-based analysis-synthesis coding (OBASC)“, Proc. Internat. Workshop on Coding Techniques for Very Low Bit-rate Video, Tokyo, Japan, 8-10 November 1995. H.G. Musmann, M. Hotter and J. Ostermann, “Object- oriented analysis-synthesis coding of moving images”, Signal Processing: Image Communication, Vol. 1, No. 2, November 1989, pp. 117-138. J. Ostermann, “Object-based analysisssynthesis coding based on the source mode1 of moving rigid 3D objects”, Signal Processing: Image Communication, Vol. 6, No. 2, May 1994, 1433161. J. Ostermann, “Object-based analysis-synthesis coding (OBASC) based on the source model of moving flexible 3D objects”, IEEE Trans. Image Process., Vol. 3, No. 5, Sep- tember 1994, pp. 7055711.

[25] D. Pearson, “Developments in model-based video coding”, Proc. IEEE, Vol. 83, No. 6, June 1995, pp. 892-905.

[26] S.L. Sclove, “Population mixture models and clustering algorithms”, Commun. Statist.-Theor. Meth., Vol. A6, No. 5, 1977, pp. 717434.

[27] A. Shabana, Dynamics of Multibody Systems, Wiley, New York, 1989, pp. l-l 16.

[28] R. Thoma and M. Bierling, “Motion compensating inter- polation considering covered and uncovered background”, Signal Processing: Image Communication, Vol. 1, No. 2, October 1989, pp. 191-212.

shape estimation of articulated 3d objects for object-based analysis-synthesis coding (obasc)

Documents