3d structure extraction coding of image sequences

13
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION Vol. 2, No. 4, December, pp. 332-344, 1991 3D Structure Extraction Coding of Image Sequences H. MORIKAWA AND H. HARASHIMA Department of Electrical Engineering, Faculty of Engineering, The University of Tokyo, 7-3-l Hongo Bunkyo-ku, Tokyo 113, Japan Received March 15, 1991; accepted July 3, 1991 This paper presents a 3D structure extraction coding scheme that first computes the 3D structural properties such as 3D shape, motion, and location of objects and then codes image sequences by utilizing such 3D information. The goal is to achieve efficient and flexible coding while still avoiding the visual distortions through the use of 3D scene characteristics inherent in image sequences. TO accomplish this, we present two multiframe algorithms for the robust estimation of such 3D structural properties, one from mo- tion and one from stereo. The approach taken in these algorithims is to successively estimate 3D information from a longer sequence for a significant reduction in error. Three variations of 3D struc- ture extraction coding are then presented-3D motion interpola- tive coding, 3D motion compensation coding, and “viewpoint” compensation stereo image coding-to suggest that the approach can be viable for high-quality visual communications. B 1991 Aca- demic Press, Inc. 1. INTRODUCTION Significant progress in image coding, image processing, and networking has created a demand for a wide range of new video services. In particular, the compression of im- age data has attracted considerable interest because of its applications in transmission and in storage of two-dimen- sional data. To date, most of the image coding schemes fall into the category of waveform coding. This approach takes advantage of the statistical correlations between neighboring pixels in order to eliminate redundancy across the images. Although mathematically tractable, the statistical models do not incorporate the structural properties of the 3D scene. More recently, we have seen a new coding approach that uses physically based models to describe the image objects. The variations in the new approach include motion compensation [I], second-generation cod- ing [2, 31, and model-based coding (or analysis-synthesis coding) [4-81. Motion compensation and second-genera- tion coding techniques are novel in that physical image source models (object-based models) were first intro- duced in coding, but their assumption of 2D structural models in simple motion has limited the flexibility and efficiency of the coding process. On the other hand, model-based image coding promises a dramatic reduction in bit rates. The requirement of a detailed and specific model, however, has limited its applicability. In this paper we present a new physically based model to address these limitations. This is called 30 structure extraction coding because of its extensive use of 3D structural properties of objects that are not restricted to any one shape [9]. The idea is to provide compact SL’PIZP representations by understanding the image formation process, such as three-dimensional shape, motion param- eters, and location of objects, rather than image repre- sentations. Under this scheme the encoder contains a computer vision system to estimate 3D structure and mo- tion information. The decoder is equipped with an “im- age synthesizer” system that reconslructs images using the structural information transmitted from the encoder. The motivations for considering the physical 3D struc- tural properties are threefold. First, the use of 3D struc- tural properties enables more efficient and higher-quality coding of image sequences. We can remove the temporal redundancy in surface texture information while still avoiding such visible distortions as blocking and mos- quito effects. We can also cope with the 3D effects of moving pictures such as occlusion, thereby refining the efficiency of the prediction process around the object boundaries. Second, a flexible coding control that adap- tively enhances the picture quality of objects of different saliency, e.g., by allocating more bits to the objects fore- most to the viewer, is available. Third, describing such 3D structural properties would open up the possibilities of providing a creative and communicative video envi- ronment, among them CG-style video handling (editing), video indexing, and video databases. To realize such a system, we need to develop algo- rithms for solving the following two problems: the recov- ery of 3D structural information from the image sequence and the coding of the image sequences using such 3D structural information. With regard to the recovery process, there are several visual cues that help us determine 3D structural informa- tion. Such cues include motion, stereo, shading, texture, contour, and others [lo]. In studying the computation of “shape from X” process, however, one immediately 1047.3203191 $3.00 Copyright 0 1991 by Academic Press, Inc. All rights of reproduction in any form reserved

Upload: h-morikawa

Post on 26-Jun-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 3D structure extraction coding of image sequences

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION

Vol. 2, No. 4, December, pp. 332-344, 1991

3D Structure Extraction Coding of Image Sequences H. MORIKAWA AND H. HARASHIMA

Department of Electrical Engineering, Faculty of Engineering, The University of Tokyo, 7-3-l Hongo Bunkyo-ku, Tokyo 113, Japan

Received March 15, 1991; accepted July 3, 1991

This paper presents a 3D structure extraction coding scheme that first computes the 3D structural properties such as 3D shape, motion, and location of objects and then codes image sequences by utilizing such 3D information. The goal is to achieve efficient and flexible coding while still avoiding the visual distortions through the use of 3D scene characteristics inherent in image sequences. TO accomplish this, we present two multiframe algorithms for the robust estimation of such 3D structural properties, one from mo- tion and one from stereo. The approach taken in these algorithims is to successively estimate 3D information from a longer sequence for a significant reduction in error. Three variations of 3D struc- ture extraction coding are then presented-3D motion interpola- tive coding, 3D motion compensation coding, and “viewpoint” compensation stereo image coding-to suggest that the approach can be viable for high-quality visual communications. B 1991 Aca-

demic Press, Inc.

1. INTRODUCTION

Significant progress in image coding, image processing, and networking has created a demand for a wide range of new video services. In particular, the compression of im- age data has attracted considerable interest because of its applications in transmission and in storage of two-dimen- sional data. To date, most of the image coding schemes fall into the category of waveform coding. This approach takes advantage of the statistical correlations between neighboring pixels in order to eliminate redundancy across the images.

Although mathematically tractable, the statistical models do not incorporate the structural properties of the 3D scene. More recently, we have seen a new coding approach that uses physically based models to describe the image objects. The variations in the new approach include motion compensation [I], second-generation cod- ing [2, 31, and model-based coding (or analysis-synthesis coding) [4-81. Motion compensation and second-genera- tion coding techniques are novel in that physical image source models (object-based models) were first intro- duced in coding, but their assumption of 2D structural models in simple motion has limited the flexibility and efficiency of the coding process. On the other hand,

model-based image coding promises a dramatic reduction in bit rates. The requirement of a detailed and specific model, however, has limited its applicability.

In this paper we present a new physically based model to address these limitations. This is called 30 structure extraction coding because of its extensive use of 3D structural properties of objects that are not restricted to any one shape [9]. The idea is to provide compact SL’PIZP representations by understanding the image formation process, such as three-dimensional shape, motion param- eters, and location of objects, rather than image repre- sentations. Under this scheme the encoder contains a computer vision system to estimate 3D structure and mo- tion information. The decoder is equipped with an “im- age synthesizer” system that reconslructs images using the structural information transmitted from the encoder.

The motivations for considering the physical 3D struc- tural properties are threefold. First, the use of 3D struc- tural properties enables more efficient and higher-quality coding of image sequences. We can remove the temporal redundancy in surface texture information while still avoiding such visible distortions as blocking and mos- quito effects. We can also cope with the 3D effects of moving pictures such as occlusion, thereby refining the efficiency of the prediction process around the object boundaries. Second, a flexible coding control that adap- tively enhances the picture quality of objects of different saliency, e.g., by allocating more bits to the objects fore- most to the viewer, is available. Third, describing such 3D structural properties would open up the possibilities of providing a creative and communicative video envi- ronment, among them CG-style video handling (editing), video indexing, and video databases.

To realize such a system, we need to develop algo- rithms for solving the following two problems: the recov- ery of 3D structural information from the image sequence and the coding of the image sequences using such 3D structural information.

With regard to the recovery process, there are several visual cues that help us determine 3D structural informa- tion. Such cues include motion, stereo, shading, texture, contour, and others [lo]. In studying the computation of “shape from X” process, however, one immediately

1047.3203191 $3.00 Copyright 0 1991 by Academic Press, Inc. All rights of reproduction in any form reserved

Page 2: 3D structure extraction coding of image sequences

3D STRUCTURE EXTRACTION CODING 333

faces the problem that most existing approaches are sen- sitive to noise. For a general coding scheme applicable to natural images, the recovery process should be robust to noise. Thus, the approach taken in our work is to succes- sively estimate 3D structural information by combining the information from multiple frames [Ill. Two algo- rithms described in this paper have been designed along this approach.

With regard to the coding process, 3D structural infor- mation can be used to achieve compression of image se- quences by applying 3D motion compensated interframe prediction and by encoding only the temporal update in- formation. 3D structural information can also be used to compress stereo image sequences effectively by applying “viewpoint” compensated prediction and by removing the spatial correlation between image pairs. This consid- eration leads to the new coding schemes such as 3D mo- tion interpolative coding, 3D motion compensation cod- ing, and viewpoint compensation stereo picture coding.

The algorithms for obtaining reliable 3D information are described in Section 2 (shape,from motion) and Sec- tion 3 (shape from stereo). Each section contains detailed descriptions of the individual stage used in the recovery process as well as the experimental results on real image sequences. The application of the extracted 3D informa- tion to image sequence coding is then discussed in Sec- tion 4.

2. 3D STRUCTURAL PROPERTIES FROM MOTION

In this section we describe how 3D structural proper- ties of rigid and nonrigid objects are robustly estimated from motion information in moving pictures. The basic organization of the recovery process is shown in Fig. 1. The process consists of three components: displacement estimation, 3D structure and motion estimation, and sur- face reconstruction. The following subsections contain detailed descriptions of the individual modules used in the recovery process.

2.1. Displacement Estimation

It is generally assumed that the brightness of corre- sponding points in two subsequent images is the same. This is formulated in

dl -& = 0, (1)

where I denotes the brightness of a point in the image and t is the time. This can be expanded into the gradient constraint equation

(2)

Image Sequence

T,Z

Surface Reconstruction z

FIG. 1. Block diagram of the “3D shape from motion” process.

where Z,, Z, , and I, are the brightness derivatives in spa- tial and temporal directions, and (u, u) is the displace- ment vector [12].

The brightness derivatives in Eq. (2) are estimated from discrete images and will be inaccurate due to noise in the imaging process and to sampling measurement er- ror. To reduce these systematic errors, we first apply the Gaussian filter and then compute derivative values from the smoothed images. Note that the temporal derivatives are also well estimated by applying this smoothing pro- cess [13].

We can compute the displacement vectors more reli- ably by considering only the zero-crossings of the La- placian of the Gaussian-filtered (smoothed) image, be- cause the derivation of Eq. (2) is based on the assumption that the gradient estimates depend only on the first deriv- atives of the brightness function, and the second and higher spatial derivatives of the brightness function are nearly zero along zero-crossings.

Displacement vectors are thus computed along con- tours formed by detecting zero-crossings of the Laplac- ian of a Gaussian (V2G)-filtered image. The analytic form of the V2G operator [14] is given by

A2G = V2 [& exp i- 9)) , (3)

where u is the space constant of the Gaussian, and we define w (= 2V’%) as the width of the central excitatory region of the operator.

The gradient constraint equation (2), however, does not by itself provide the means for calculating uniquely the displacement vectors. Additional assumptions about the displacement vectors are necessary to make the solu- tion unique. Here we use a simple assumption that dis- placement vectors are constant over zero-crossings within the small neighborhood of an n, x ~1, window.

Figure 2 illustrates the displacement estimation pro- cess outlined above. Figure 2a shows a frame of the head and shoulder image sequence “Rotating Head.” The

Page 3: 3D structure extraction coding of image sequences

334 MORIKAWA AND HARASHIMA

FIG. 2. Displacement estimation. (a) The original image, containing 256 x 240 pixels. (b) The resulting zero-crossing contours. (c) The computed displacement fields on the zero-crossing contours.

contours of zero-crossings corresponding to the sharp, high-contrast edges are illustrated in Fig. 2b. Figure 2c shows the estimated displacement vectors along the zero- crossings. The parameters w = 8, n, = 8 are used in the example.

2.2. 30 Structure and Motion Estimation

The task of the 3D structure and motion estimation stage is to determine 3D structure and motion from the displacement vectors. In this section we first discuss the smoothness-of-motion constraint and then show how 30 information is incrementally recovered through the use of the smoothness-of-motion constraint.

2.2.1. Smoothness-of-motion constraint. To date, almost all research in structure from motion has adopted the rigidity assumption and attempted to recover the structure and motion from a limited number of views of the scene, typically two or three views [12]. However, the rigidity assumption and the two or three view ap- proach can be sensitive to noise and clearly are inappro- priate when dealing with deformable objects.

We instead utilize a smoothness-of-motion constraint and attempt to estimate the structure and motion from a longer sequence [15, 161. The smoothness-of-motion con- straint is based on the nature of motion: the moving ob- jects exhibit a smooth motion due to inertia and elastic- L

These motion field equations can be established at pixels where the displacement vectors (u, u) are obtained. The task is then to estimate the 3D information: the motion parameters w(t + 1) and T(t + 1) as well as the depth value Z(t). Note that the motion parameters w and T are global quantities; i.e., there is one rigid body motion for one object. The depth Z, however, may vary spatially. We compute the 3D information incrementally from a longer sequence based on the smoothness-of-motion con- straint .

Suppose that at a given instant we have the estimates of the motion parameters w(t) and T(t) and the depth value Z(t). The estimates in general will differ from the true values of motion and denth. We wish to uudate the

ity. The smoothness-of-motion constraint is valid for most physical objects if a frame sequence is acquired at a rate such that no dramatic changes take place between frames.

The advantage of the smoothness-of-motion approach is that it provides a framework for temporal integration of information over a longer sequence and enables the suc- cessive refinement of 3D structure and motion. The moti- vation for using a longer sequence is twofold. First, the multiframe approach reduces the sensitivity to noise by using a larger amount of correlated data. Second, the multiple-frame approach may have the ability to cope with nonrigid objects, since a longer sequence gives more constraint than a traditional two or three view approach. We believe that the smoothness of motion is more gen- eral than the traditional rigidity assumption.

2.2.2. Formulation. Let X = (X, Y, Z)T represent a spatial point coordinate, and let x = (x, Y>~ represent a corresponding image plane coordinate. We assume or- thographic projection such that X = x and Y = y, and the Z-axis is aligned with the optical axis. (The formulation is easily extended to perspective projection but is computa- tionally complex.) Let the relative motion of a rigid ob- ject in the scene between time t and t + 1 consist of a rotation w(t + 1) = (wx(t + l), o,(t + l), w,(t + l))T and a translation T(t + 1) = (T,(t + l), T,(t + l), T,(t + l))T.

Given this situation, the point P at location X = (X, Y, Z)T will move according to

i% = w(t + 1) x X(t) + T(t + l), (4)

where the dot denotes temporal differentiation. Then the displacement vectors (j;, i) = (u, ZJ) on the image plane, which is the projection of the 3D motion X, can be writ- ten as

u = -w;(t + l)y(t) + w,(t + I)Z(t) + ry,(t + I) (5)

u = wz(t + 1)x(t) - w.r(t + l)Z(t) + zy,,(t + I). (6)

Page 4: 3D structure extraction coding of image sequences

335 3D STRUCTURE EXTRACTION CODING

where estimates to make them closer to the true solution as the new frame t + 1 becomes available. To accomplish this we introduce the update terms Aw(t) = (Am,(t), A+(t), Au,(t)), AT(t) = (AT,(t), AT,(t), AT,(t)), and AZ(t), and update the estimates by

fii = (-o,(t) -. Aw,(t>>yi(t> + (my(t) + Aay(t>> (2i((t) + AZi(t)) + T’(t) + AT,(t)

Oi = (wZ(t) + Ao,(t))xi(t) + (-o,(t) - Aox(

(i?i(t) + AZi(t)) + Ty(t) + AT,(t). (13) o(t + 1) = w(t) + Au(t) (7)

T(t + 1) = T(t) + AT(t) (8)

Z(t) = &t) + AZ(t). (9)

Thus, given the estimates of a time t, we compute the update terms A@(t), AT(t), and AZ(t) based on the smoothness-of-motion constraint, instead of estimating w(t + I), T(t + l), and Z(t) directly. To compute the update terms we substitute Eqs. (7), (8), (9) into Eqs. (5), (6) to obtain

l4 [1 [ 0 -0, + Am, 0, + Aw, = V oz + ho, 0 -u.,- - Aw, 1 f

The smoothness-of-motion constraint means that the up- date terms Am(t), AT(t), and AZ(t) should be small. Con- sequently, we introduce the functional IIPII’ as a measure of the smoothness, and formulate the smoothness-of-mo- tion constraint as a functional IIP/I’ to be minimized,

llP(t)l12 = f l(Ao(t)112 + ; I/ATW12 + y 2 (AZ(t))‘> i=l

(11)

where (Y, p, and y are scale parameters, 1) . . . // is L2 norm, IZ is the number of the points considered in the computation, and Z; is the depth value of the ith point.

This smoothness-of-motion constraint, i.e., the func- tional ~IP(/~ to be as small as possible, can be incorporated into the motion field equation (10). That is, to determine the update terms, Am(t), AT(t), and AZ(t), we choose the function E to be minimized as a sum of two terms: the first one is the error function based on Eq. (10) in the least-square sense, and the second term is the cost func- tion (II),

E(A4h AT(t), AZ(t))

= i ((Uj - Lii)2 + (Vi - fii)2) + llP(t)((*, (12) FIG. 3. Estimation of 3D structure information on the zero-crossing ,=, contours (frame 43).

After the update terms Au(t), AT(t), and AZ,(t) have been determined with the minimization of the functional E, new 3D information, w(t + l), T(t + I), and Zi(t), is derived from Eqs. (7), (8), (9).

The operation of the incremental recovery process on a rotating head video sequence is illustrated in Fig. 3. This figure shows the estimates of the depth value along the zero-crossings at frame 43. The estimates are obtained incrementally from the previous estimates o, T, and 2 at frame 42. The eyes and noses can be seen in the esti- mated depth information. The parameters used in Eq. (ll)area= 100,/3== l,y=3.Thevalueoftheparameter (Y is designed to be almost IO2 times larger than the values of /3 and y, since Ao is radian while AT and AZ are pixel values. Furthermore, these values are chosen from the trade-off among the speed of the convergence, robust- ness to noise, and ability of coping with nonrigid objects. The empirical studies show that the scheme works well with noise and nonrigid motion, and the sensitivity of the parameter selection to estimation results is not so strong [15, 161.

We have described above the recovery process of com- puting displacement vectors from Eq. (2) and then using them in Eq. (12) to obtain structure and motion informa- tion. Instead of this two-stage process, we can also con- sider the direct method by plugging the motion field equa- tion (10) into the gradient constraint equation (2) and obtaining one equation that links image brightness gradi- ents to structure and motion parameters.

Page 5: 3D structure extraction coding of image sequences

336 MORIKAWA AND HARASHIMA

2.3. Surface Reconstruction

The displacement estimation algorithm described in Section 2.1 utilizes the zero-crossings, and then subse- quent structure from motion computation (Section 2.2) derives depth information Z,(t) only at the locations of zero-crossings (xi, y;). From the viewpoint of image cod- ing, it may be useful to produce dense or complete sur- face representations. Some type of filling-in or interpola- tion of depth information is needed. To this end we reconstruct a smooth surface by interpolating an initial representation consisting of depth information at the zero-crossings. The motivations behind this are twofold. First, the absence of a zero-crossing tells us that no phys- ical property is changing and that the surface is smooth to some extent. Second, the smooth surface makes the in- ter-frame prediction efficient in the coding process [ 171.

Reconstructing smooth surfaces is a problem that has been dealt with extensively [l&-21]. Among several algo- rithms, we employ a form of surface splines [19, 201 which is computationally simple. Given the depth infor- mation Zi(x, y, t) along zero-crossings, we seek to deter- mine a desired surface function Z(x, y, t) by minimizing

&n(z(x, Y, t)) = &moth + &me

+ C (2(X, Y, f) - Zi(X3 y3 t))‘, (14)

where the subscripts to Z denote partial derivatives, the summation takes place along zero-crossings where there is a depth measurement Z;(x, y, t), and in. is a nonnegative scale parameter. The first term enforces “smoothness” of the surface, and the second “closeness” to the input data.

The current estimate surface Z(x, y, t) can then be used to predict the estimates for the depth value Z(x, y, t + I) in the next iteration. The predicted surface Z(x, y, t + 1) is computed by the geometrical transformation of a sur- face Z(x, y, t) according to the estimates of the motion

a

FIG. 4. The original images. (a) “Miss America” frame 97.

a FIG. 5. Perspective views of the recovered surface. (a) “Rotating

Head” frame 43. (b) “Miss America” frame 97.

parameters o(t + 1) and T(t + 1). The predicted surface Z(x, y, t + 1) then serves as an initial depth estimate of the zero-crossings in the next iteration to repeat the pro- cess described in Section 2.2. A full 3D surface is thus built up over an extended time.

We have applied the above process to the two head and shoulder video sequences Rotating Head and “Miss America” (Fig. 4). Figure 5a is an example of applying the algorithm to the estimated depth values as shown in Fig. 3 (parameter p = 1). The estimated structure tends to be a smooth one, due mainly to the smoothing process of the surface reconstruction for reducing the noise ef- fect. The Miss America example is illustrated in Fig. 5b. The Miss America contains little 3D motion; conse- quently the estimated shape tends to be flat.

3. 3D STRUCTURAL PROPERTIES FROM STEREO

In the previous section we described the 3D recovery process using the motion cue. Another important passive cue for determining 3D structural properties is stereo. In this section we present the “shape from stereo” process developed for obtaining reliable 3D structural properties.

The most difficult problem in shape from stereo is to establish a correspondence of features from two images. In a number of existing approaches attempts are made to obtain reliable correspondence from a set of stereo im- ages at a given instant, i.e., from a snapshot of stereo images [22]. Because the measurement of image bright- ness introduces error, such a single snapshot cannot pro- vide very accurate information. To overcome this diffi- culty, we approach this problem by integrating information from an extended viewing period of stereo images [23]. The scheme consists of three components: matching, prediction with motion estimation, and update of the surface information, as shown in Fig. 6. The fol- lowing subsections contain detailed descriptions of the individual modules.

In what follows we choose the coordinate system as shown in Fig. 7 with the origin at the left camera’s center of projection (principle distance f) and the optical axis aligned with the Z axis. A point P at location X = (X, Y, Z)T in the scene is imaged on the left and right

Page 6: 3D structure extraction coding of image sequences

3D STRUCTURE EXTRACTION CODING 337

Stereo Images T Stereo Images T+l

Motion Estimation

Surface Estimate 2

FIG. 6. Block diagram of the “3D shape from stereo” process.

image planes at pixel locations xl = (x,, ~1)~ and x,. = (x,, Y,)~, respectively.

3.1. Stereo Matching

As shown in the structural block diagram in Fig. 6, as the next stereo image pair t + 1 becomes available, the stereo matching process is performed to obtain dispari- ties, and consequently the depth information 2(x, y, t + 1).

In the matching process, we first compute the zero- crossings of the V*G-filtered left and right images for the symbolic features to be matched. (The precise form of the V*G operator is given in Eq. (3).) In addition to the loca- tions of the zero-crossings, we also extract the contrast sign of the zero-crossings (whether the filtered values change from positive to negative, or negative to positive, as we scan along horizontal lines) [19].

Given a set of zero-crossing representations for each of the images, the matching process takes place between the zero-crossings that are of the same contrast sign, and the 3D depth Z(x, y, t + 1) of the scene along zero-crossings is extracted using triangulation. For each zero-crossing in one image (say the left) at position (x,, y,), a set of candi- date zero-crossing points is selected from the region of the right image

{(x,, Yr) I XI - WC 5 xr 5 XI? Yr = Yl>? (15)

where W, is a given some estimate of the maximum dis- parity (which we may initially assume to be some arbi- trary value). If more than one match is found within the above region, then we compute the correlation value of an n, x n, window around a zero-crossing in the left and right filtered image, and we accept the match for which the correlation value is largest as well as greater than a given threshold Th. Otherwise, the match at that point is

left ambiguous, and the computation of the depth is not performed.

Since an incorrect match point might satisfy the match- ing constraint described above, we cannot expect all of the estimated depth information Z(x, y, t + 1) to be accu- rate. Conceptually, information from multiple frames may be useful to reduce such errors and produce a useful estimate over time. The following subsections describe the way in which multiple frame measurements can be integrated.

3.2. Prediction with Motion Estimation

As shown in the block diagram in Fig. 6, as the next stereo image pair 5 + 1 becomes available, the motion estimation and prediction process is also performed in parallel with the stereo matching process. The task of the motion estimation and prediction stage is to estimate 3D motion and to determine what our current depth estimate 2(x, y, t) will look like at time t + 1.

In the motion estimation stage, we can exploit the in- cremental nature of the recovery process. At every in- stance in time, the process produces an estimate of the depth 2(x, y, t), and we use this estimate of the depth map as input to the motion estimation stage as shown in the block diagram of Fig. 6. The task of the motion esti- mation stage is then to compute the motion parameters o and T between time t and t + 1, given a depth map 2(x, y, t), the gradient constraint equation (2) of left image, and the equation of perspective projection

x=f$ (16)

y=f;. (17)

Y

FIG. 7. Relationship between the coordinate system and the cam- eras.

Page 7: 3D structure extraction coding of image sequences

338 MORIKAWA AND HARASHIMA

The motion field equations in the case of the perspec- tive equation can be obtained by substituting (I6), (17) in (4),

u JT, - XT, z + f(-%xy + w,Jx* + f’) - w,yf)

(18)

u JTv - YT; Z + +x2 + f2) + (+xy + w,xf).

(19)

By plugging these motion field equations into the gradient constraint equation (2), we can obtain one equation that links image brightness gradients I.,, I!, I, to motion pa- rameters w and T linearly. We obtain one such linear equation for one pixel in the region of interest. The mo- tion parameters are then computed using a least-square method.

Given the motion parameters o and T, we now apply the prediction process to determine what our current sur- face estimate Z(x, y, t) will look like at time t + I. This requires a geometrical transformation of a surface i(x, y, t) according to the equation of rigid body motion as shown in (4). The surface Z(x, y, t) is, however, known only through samples on a discrete grid. Consequently, we resample the transformed surface by interpolating the given transformed samples to obtain the predicted sur- face .Z(x, y, t + 1).

In summary, the motion estimation and prediction pro- cess will convert the current estimate Z(x, y, t) into the new estimate Z-(x, y, t + I). This new estimate can then be combined with the depth information Z(x, y, t + I) obtained from the stereo matching process to update the estimate in the manner described below.

3.3. Update of Surface Information

As we see from the block diagram of Fig. 6 the task of the update stage is to take as input depth measurements Z(x, y, t + I) along zero-crossings and combine them with the new estimate Z-(x, y, t + I) to update the esti- mate. The update process then finds a surface Z(x, y, t + I) that is as close as possible to both Z(x, y, t + I) and Z-(x, y, t + 1). The values of both Z(x, y, t + 1) and Z-(x, y, t + l), however, are subject to errors. To reduce the sensitivity to such errors, we assume that the surface being observed exhibits some amount of smoothness. In essence, the update process can be considered to be the surface reconstruction procedure described in Section 2.3.

Formally, similar to Eq. (14), we compute the desired surface function Z(x, y, t + 1) by minimizing the energy function

Sh f R? (.&x, Y, t + I) - i-(x, y, t + A 1)’ dx dy

+ c (-ax, Y, t + 1) - Z(x, y, t + l))z, G9) k.?‘ED

where the summation takes place over the region D for which depth measurements Z(x, y, t + I) are obtained, and p and h are nonnegative scale parameters. This new surface Z(x, y, t + 1) can then be used as an initial depth estimate in the next iteration of the surface reconstruc- tion procedure and the process repeats itself.

The function E, contains three terms Esmooth, Eclosel, and hose2 7 and the scale parameters p and A determine the relative strength of the three terms. Recently, Heel [24] showed that the surface reconstruction process is essentially identical to the update procedure of the Kalman filter by choosing the scale parameter to be the inverse variance of the estimate. Similarly, to determine the optimal values for p and A, we require analysis of the errors in Z(x, y, t + I) and Z-(x, y, t + I), so that the scale parameters may be chosen to smooth out the effect of these errors.

3.4. Experimentul Results

The multiple-frame approach is a way to increase the robustness and accuracy of the solution by providing ad- ditional redundancy to the algorithm. To confirm this result, we use a head and shoulder stereo image sequence (256 X 240 pixels, 8 bits). The scene is shown in Fig. 8. This scene poses a number of difficulties because it con- tains large regions of uniform brightness, which makes stereo matching difficult. Since true depth maps are rarely available for real scenes, we assess the depth esti- mates in a more qualitative manner here.

The left and right images are convolved with the V’G filter with central width given by MI = 8. Zero-crossing obtained from these convolutions are shown in Fig. 9. The contrast signs of zero-crossings are displayed using intensity black or white. The disparity map obtained by matching the zero-crossings of the first stereo images is shown in Fig. 10 (the parameters wc = 50, IZ, = 11, and Th = 0.995). The disparity map is displayed using inten- sity to encode depth, so that the brighter disparity points are closer to the camera. In this experiment, the total number of zero-crossing pixels is 2280, the number of matched zero-crossings is 1837, and the maximum dis- parity is 43 (pixels). Figures lla, lib, and llc show the

Page 8: 3D structure extraction coding of image sequences

3D STRUCTURE EXTRACTION CODING 339

a

b

FIG. 8. The original image sequence. (a) Frame I (b) Frame I I. (c) Frame 21.

perspective views of the recovered surface of the face obtained after 1, 11 and 21 iterations of the algorithm. The parameters p = 1 and A = 0.1 are used in the surface reconstruction equation (20). These perspective views show how the estimate improves over time, in particular, how the shape of the nose becomes more distinct. This can be explained by the fact that the zero-crossings around the nose part become more explicit as the face rotates, and the algorithm integrates this information over time.

FIG. 10. Disparity map (frame I).

4. USING 3D STRUCTURAL PROPERTIES FOR CODING

In the previous sections we have described how the 3D structural properties of rigid and nonrigid objects are ro- bustly estimated from a sequence of images. Given such 3D structural properties of scenes, several manipulations of real images are possible. From the coding point of view, we are interested in how the 3D structural proper- ties are incorporated into coding. We refer to this coding technique as “3D structure extraction coding.”

In this section we present the concept of 3D structure extraction coding and then show the usefulness of 3D structural descriptions in 3D motion interpolative coding, 3D motion compensation coding, and viewpoint compen- sation stereo image coding.

4.1. 30 Structure Extraction Coding

The image is highly structured and organized: image components can be grouped on the basis of regularities, e.g., closeness, similar form, continuity, and similarity. It is this internal structuring that allows us to proceed to spatial or semantic understanding, and also to obtain the compact and efficient description of images. Image cod- ing techniques can then be considered to have two main processes, as shown in Fig. 12: recovery of these regular- ities (messages) and coding.

Well-developed waveform coding schemes utilize these regularities as statistical models and extract mes- sages through prediction or orthogonal transform. De-

FIG. 11. Perspective views of the recovered surface using incre- mental stereo. (a) Frame I. (b) Frame 1 I. Cc) Frame 21. FIG. 9. Zero-crossings with contrast sign (frame I).

Page 9: 3D structure extraction coding of image sequences

340 MORIKAWA AND HARASHIMA

FIG. 12. Image coding. FIG. 13. 3D structure extraction coding.

spite the prevalence of waveform coding, the use of sim- ple statistical models generally means that the efficiency, flexibility, and interactivity of the coding process are limited.

One approach to overcome this difficulty is to utilize the structure and motion information, which is among the most important effects that must be determined from im- age sequences. In the development of this approach, the models describing the structure and motion information become important. With more sophisticated models, more coding gain can be achieved while still avoiding visual distortions. To date, however, only relatively sim- ple models have been investigated, e.g., a 2D rectangular object undergoing 2D translational movement used as in a block-matching method. This is due mainly to the com- putational complexity and the required real-time pro- cessing.

3D structure extraction coding, in contrast, employs a physically based source model of a 30 moving object which is not restricted to any particular shape. Table 1 gives a brief description of image coding techniques and their associated image source models. 3D structure ex- traction coding thus consists of the 3D information recov- ery stage and coding stage as shown in Fig. 13. Compared with the existing coding schemes, in this coding tech- nique much heavier emphasis is placed on the first block of Fig. 13 to extract the 3D structural properties, such as three-dimensional shape, motion parameters, and loca- tion of objects.

Given such 3D structural properties, we are able to use them for synthesizing new images, for example, by rotat- ing, translating, zooming, or deforming the objects and remapping them onto the image plane. These characteris- tics allow one to utilize such 3D structural properties for

TABLE 1 Image Coding Techniques and Image Source Models

Coding techniques

Waveform coding Motion compensation coding Second-generation coding

Image source models

Stochastic model

2D planar and motion model 2D structure model (contours, texture)

3D structure extraction coding Model-based coding

3D structure and motion model 3D model (a priori knowledge)

coding image sequences flexibly and efficiently, as we discuss below.

4.2. Examples of 30 Structure Extraction Coding

In this section we present three image coding proce- dures that rely on the estimated 3D structural properties: 3D motion interpolative coding, 3D motion compensation coding, and viewpoint compensation stereo image cod- ing.

4.2.1. 30 motion interpolative coding. Frame inter- polation appears as an attractive scheme for both further reducing the bit rate and obtaining high-quality images. Simple interpolation techniques such as frame repetition and linear interpolation, however, show visible degrada- tions like jerkiness and blurring.

If 3D structure and motion information between trans- mitted frames is available, it is straightforward to synthe- size the skipped images while maintaining the interpo- lated image’s naturalness. The frame to be interpolated is computed as follows. Let the transmitted frames have the associated temporal position t = 0 for frame k - 1 and t = 1 for frame k. Suppose we denote the motion of objects between frame k and k + 1 by a rotational matrix R(w) and a translation T. The rotational matrix R can be writ- ten in terms of a rotation w = (w,~, w,, wJT as

The frame to be interpolated at t = r (0 < T < 1) is then calculated as a function of o, T, and r as

S(R(m)X + Q-T) = (1 - Q-)S~(X) + 7Sk+j(RX + T), cm

where Sk(X) represents image intensities of frame k cor- responding to the location X = (X, Y, Z)T.

The above 3D motion interpolative coding has been investigated by means of computer simulation on two head and shoulder video sequences, Rotating Head and Miss America (Fig. 4). The frequency of the original se- quences is reduced to 7.5 frames/s by omitting three frames out of four. From these transmitted frames the 3D structure and motion information is estimated using the

Page 10: 3D structure extraction coding of image sequences

3D STRUCTURE EXTRACTION CODING 341

process described in Section 2, and the skipped frames are obtained by 3D motion interpolation.

The 3D motion interpolated images have been com- pared to results obtained by frame repetition and linear interpolation. Figure 14 shows the images obtained by linear interpolation (left) and 3D motion interpolation (right). From these experiments we find that the 3D infor- mation recovery process described in Section 2 performs sufficiently for 3D motion interpolation and generates a better-quality image sequence because the 3D motion of objects is taken into account to preserve the natural im- pression of motion.

4.2.2. 30 motion compensation coding. The 3D structural properties can also be utilized by incorporating them into a predictive loop of the coding process. We refer to this coding as “3D motion compensation cod- ing.” The key feature of 3D motion compensation coding is that the transmitted 3D information represents one or more global attributes of the image sequences. According to this property, we can avoid visible distortions such as blocking and mosquito effects by transmitting the global 3D parameters of objects, rather than local motion pa- rameters as obtained from the block-matching method. In addition, we can improve the coding efficiency both by coping with 3D effects, such as rotation, occlusion, zoom, and pan, in the prediction stage and by transmit- ting only those global attributes.

This 3D motion compensation coding method can be viewed from another perspective as a combination of analysis-synthesis coding and conventional waveform coding. Errors occurring at the scene analysis stage are

FIG. 14. 3D motion interpolation: linear interpolated image (left) and 3D motion interpolated image (right). (a) “Rotating Head.” (b) “Miss America.”

FIG. 1.5. 3D motion compensation: frame difference image (left) and 3D motion compensated prediction error image (right). (a) “Rotating Head.” (b) “Miss America.”

compensated using well-developed waveform coding techniques. It thus seems that this approach simplifies the scene analysis task, which is the main topic in the fields of computer vision, and will be a promising solution to the analysis-synthesis coding.

In order to give an impression of the predictability of 3D motion compensation coding, we present examples of 3D motion compensation coding. As with 3D motion in- terpolation above, the 3D structural properties are esti- mated incrementally over time as described in Section 2. The right side of Fig. 15 shows the 3D motion compen- sated prediction error images on the two image se- quences Rotating Head and Miss America. The error is amplified by a factor of 10 and truncated to 255, and an inverse bit assignment is used (255 corresponds to black) for Fig. 15. We see that there still remain some errors around edges where the prediction is insufficient. To overcome this, it would be necessary to incorporate the local prediction process as well as the global one. The left side of Fig. 15 shows the simple frame difference images which demonstrate the amount of motion in a sequence of images.

In these experiments, we predict the next frame, as- suming that intensity remains constant as the object moves. In general, however, image intensity changes along motion trajectory. This is easily understood by the commonly used illumination model in computer graphics

1 = lambient + lincident(lr n, r3 V), (23)

where I is image intensity and I, n, r, and v are the unit surface normal, the unit vector to the light source, the

Page 11: 3D structure extraction coding of image sequences

342 MQRIKAWA AND HARASHIMA

unit vector in the direction of reflection, and the unit vector toward the viewpoint, respectively [25]. Thus if illumination can be modeled as ambient light (the light incident from the environment, not from specific light sources), the simple prediction process in our experi- ments works well. Otherwise, we must consider the sec- ond term Iincident and compensate the image intensity changes due to the motion for improving the prediction efficiency. Empirical studies show, however, that a sim- ple prediction process used in our experiments is suffi- cient if the motion between frames is small and if images contain no highlights.

4.2.3. “Viewpoint” compensation coding. Devel- oping a technique for 3D video systems and holographic television is a long-held dream of both optical and elec- tronics researchers. One of the technical problems in de- veloping such 3D systems involves the techniques for coding the extremely large 3D information. At this point, we can exploit the 3D nature of the information to be transmitted and utilize the estimated 3D structural prop- erties in coding.

The idea of the coding is the following. Given such 3D structural properties, we can synthesize new images from different viewpoints by geometrical transformations. This transformation performs the necessary rotations, translations, and perspective transformations according to the configuration of the coordinate systems in which stereo (or multiviewpoint) images are taken. This trans- formed version of one image would then be a good pre- dictor of other images, and by this we remove the spatial correlation of stereo images. The 3D structural properties are thus utilized in a spatial prediction process, instead of the temporal prediction as in the 3D motion compensa- tion coding described in Section 4.2.2. We refer to this coding as “viewpoint compensation coding,” and we consider this coding scheme the extension of the dispar- ity compensation coding [26] which measures the dispar- ity using the local method, e.g., the block-matching method.

An example of this coding is shown in Fig. 16. We use a stereo image sequence, as shown in Fig. 8. The 3D structural properties are estimated from the stereo image sequence incrementally over time as described in Section 3. The configuration of the two cameras is known. Using this knowledge, we synthesize the predicted left image only from the right image as shown in Fig. 16a. In this example the areas caused to appear or disappear by the prediction process are linearly interpolated to obtain the predicted image. The right side of Fig. 16b shows the prediction error image between the original image and the predicted image which is shown in the right side of Fig. 16a. The left side of Fig. 16b shows the prediction error

a

FIG. 16. Viewpoint compensation stereo image coding. (a) The original right image (left) and the predicted left image from the original right image using viewpoint compensation (right). (b) The error image between the original left image and the translated right image (left) and the error image between the original left image and the viewpoint com- pensated image (right).

image between the original left image and the predicted image obtained by simply translating the right image so that the overall error energy is minimum. From this ex- ample we see that viewpoint compensation coding copes with the 3D nature of the stereo images.

5. CONCLUSION

We have presented two multiframe algorithms for the dense estimation of 3D structure and motion information, one from motion and one from stereo, and also their ap- plication to image coding, i.e., 3D structure extraction coding.

The key idea of these 3D recovery algorithms is to successively estimate 3D structural properties by com- bining information from multiple frames. The use of a longer sequence enables the 3D recovery process to be robust to noise. The preliminary results show that these schemes work well with noisy natural images.

The assumption that we have made at this recovery process is the segmentation of scene. To overcome this, it would be necessary to segment the image into regions corresponding to independently moving objects and then run the recovery process on each region independently. Toward this end, we are currently studying the incremen- tal segmentation algorithm which includes the dynamic occlusion analysis for improving the segmentation results over time [l 11.

Page 12: 3D structure extraction coding of image sequences

3D STRUCTURE EXTRACTION CODING 343

As for coding, we have presented the concept of 3D structure extraction coding and given as examples three image coding procedures that rely on the 3D structural properties in scene, 3D motion interpolative coding, 3D motion compensation coding,. and viewpoint compensa- tion coding. Instead of the statistical correlations or 2D properties used in conventional coding techniques, 3D structure extraction coding computes the 3D structural properties to be transmitted, such as three-dimensional shape, motion, and location of objects. Using such 3D structural properties inherent in image sequences, we can achieve efficient and flexible coding while still avoiding visual distortions.

Such a 3D structure extraction coding system, how- ever, is still in a very early stage and not complete. It seems clear that the efficiency of the coding schemes must be evaluated quantitatively. The interesting but diffi- cult problem is developing the method for coding the estimated 3D structural properties along with the appro- priate coding strategy. The most straightforward way of coding the estimated surface is to transmit the depth in- formation only along the zero-crossings from which the 3D information is obtainable. Toward the goal of achiev- ing much higher efficiency and flexibility we intend to develop a compact parametric representation of a 3D scene. At this point, we have an object representation problem that has been the main topic in computer vision for developing a recognition system. We will also pay attention to the coder control criterion rather than the mean square error criterion.

ACKNOWLEDGMENTS

The authors had helpful conversations with P. R. Hsu and thank S. Aoki for her help in implementing the stereo algorithm.

I.

2.

3.

4.

5.

REFERENCES

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

H. G. Musmann, P. Pirsh, and H. J. Grallert, Advances in image coding, Proc. IEEE, 73, 4, Apr. 1985, 523-548. 23. J. K. Yan and D. J. Sakrison, Encoding of images based on a two- component source model, IEEE Truns. Comm. 25, 1 I, Nov. 1977, 1315-1322. 24. M. Kunt, A. Konomopoulous, and M. Kocher, Second-generation image coding, Proc. IEEE 73,4, Apr. 1985, 549-574. H. Harashima, K. Aizawa, and T. Saito, Model-based analysis syn- 25. thesis coding of videotelephone images, IEZCE E72, 5, May 1989, 452-459. 26. K. Aizawa, H. Harashima, and T. Saito, Model-based analysis syn- thesis image coding system for a person’s face, .Signcr/ Process. image Comm. 1, 2, Oct. 1989, 139-152.

H. G. Musmann, M. Hiitter, and J. Ostermann, Object-oriented analysis-synthesis coding of moving images, Signal Process. Image Cornm. 1, 2, Oct. 1989, 117-138. D. E. Pearson, Model-based image coding, in Proceedings, IEEE GLOBECOM-89, Dallas, TX, Nov. 1989, pp. 554-558. R. Forchheimer and T. Kronander, Image coding-From wave- form to animation. IEEE Truns. Accoust. Speech Signal Process. 37, 12, Dec. 1989, 2008-2023. H. Morikawa and H. Harashima, 3-D structure extraction coding of image sequences, in Proceedings, International IEEE Conference on Accoustics, Speech, und Signal Processing, Albuquerque, NM, Apr. 1990, pp. 1969.-1972. J. Aloimonos, Visual shape computation, Proc. IEEE 76, 8, Aug. 1988, 899-916. H. Morikawa, E. Kondo, and H. Harashima, Structural description of moving pictures for coding, in Picture Coding Symposium (PCS’91), Tokyo, Japan, Sept. 1991. J. Aggarwal and N. Nandhakumar, On the computation of motion from sequences of images, Proc. IEEE 76, 8, Aug. 1988, 917-935. J. K. Kearney, W. B. Thompson, and D. L. Boley, Optical flow estimation: An error analysis of gradient-based methods with local optimization, IEEE Truns. Pattern Anal. Much. Intell. 9, 2, Mar. 1987, 229-244. D. Marr and E. C. Hildreth, Theory of edge detection, Proc. Roy. Sot. London Ser. B 207, 1980, 187-217. H. Morikawa and H. Harashima, Rigid and nonrigid motion analy- sis: Robust recovery of 3-D structure and motion, in IAPR Internu- tional Workshop on Machine Vision Applications (MVA ‘90). To- kyo, Japan, Nov. 1990, pp. 283-286. H. Morikawa and H. Harashima, Structure and motion of deforma- ble objects from image sequences, in Proceedings, International IEEE Conference Accoustics, Speech, and Signal Processing, To- ronto, Canada, May 1991, pp. 2433-2436. D. Pearson, Texture mapping in model-based image coding, Signal Process. Image Comm. 2, 4, Dec. 1990, 377-395. R. Franke, Scattered data interpolation: Tests of some methods, Muth. Comp. 38, Jan. 1982, 181-199. W. E. L. Grimson, From Images to Surfices: A Computational Study of the Human Early Visual System, MIT Press, Cambridge, MA, 1981. D. Terzopoulos, Multilevel computational processes for visual sur- face reconstruction, Comput. Vision Graphics Image Process. 24, 1983, 52-96. A. Blake and A. Zisserman, Visual Reconstruction, MIT Press, Cambridge, MA, 1937. U. R. Dhond and J. K. Aggarwal, Structure from stereo-A re- view, IEEE Trans. Systems Man Cybernet. 19,6, Dec. 1989, 1489- 1510. H. Morikawa, S. Aoki, and H. Harashima, Determining 3-D Struc- ture and Motion from a Sequence of Stereo Images, Tech. Rep. PRU89-57, IEICE, Japan, Sept. 1989. [In Japanese] J. Heel, Temporally integrated surface reconstruction, in Proceed- ings, 3rd International Conference on Computer Vision, Osaka, Japan, Dec. 1990, pp. 292-295. J. D. Foley and A. van Dam, Fundumentuls of Interactive Com- puter Graphics, Addison-Wesley, Reading, MA, 1984. M. E. Lukas, Predictive coding of multi-viewpoint image sets, in Proceedings, International IEEE Conference on Accoustics Speech, and Signal Processing, Tokyo, Japan, Apr. 1986, pp. 521- 524.

Page 13: 3D structure extraction coding of image sequences

344 MORIKAWA AND HARASHIMA

HIROYUKI MORIKAWA received the B.E. and M.E. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1987 and 1989, respectively. He is currently working toward the Dr.E. degree in electrical engineering at the University of Tokyo. His re- search interests are in the areas of image coding, image communication, and computer vision.

HIROSHI HARASHIMA received the B.E., M.E., and Dr.E. de- grees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1968, 1970, and 1973, respectively. From 1983 to 1975 he was a full-time lecturer, from 1975 to 1990 he was an associate professor, and now he is a professor of electrical engineering at the University of Tokyo. In 1984 he was a visiting associate at Stanford University, Stan- ford, California. His research interests include communication theory, coding theory, digital modulation, image coding and processing, and digital signal processing. He received the 1973 Yonezawa Memorial Award, the 1979 Achievement Award, and the 1989 Best Paper Award from the IEICE Japan.