3d structure extraction coding of image sequences

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION

Vol. 2, No. 4, December, pp. 332-344, 1991

3D Structure Extraction Coding of Image Sequences H. MORIKAWA AND H. HARASHIMA

Department of Electrical Engineering, Faculty of Engineering, The University of Tokyo, 7-3-l Hongo Bunkyo-ku, Tokyo 113, Japan

Received March 15, 1991; accepted July 3, 1991

This paper presents a 3D structure extraction coding scheme that first computes the 3D structural properties such as 3D shape, motion, and location of objects and then codes image sequences by utilizing such 3D information. The goal is to achieve efficient and flexible coding while still avoiding the visual distortions through the use of 3D scene characteristics inherent in image sequences. TO accomplish this, we present two multiframe algorithms for the robust estimation of such 3D structural properties, one from motion and one from stereo. The approach taken in these algorithims is to successively estimate 3D information from a longer sequence for a significant reduction in error. Three variations of 3D structure extraction coding are then presented-3D motion interpolative coding, 3D motion compensation coding, and “viewpoint” compensation stereo image coding-to suggest that the approach can be viable for high-quality visual communications. B 1991 Aca-

demic Press, Inc.

1. INTRODUCTION

Significant progress in image coding, image processing, and networking has created a demand for a wide range of new video services. In particular, the compression of image data has attracted considerable interest because of its applications in transmission and in storage of two-dimensional data. To date, most of the image coding schemes fall into the category of waveform coding. This approach takes advantage of the statistical correlations between neighboring pixels in order to eliminate redundancy across the images.

Although mathematically tractable, the statistical models do not incorporate the structural properties of the 3D scene. More recently, we have seen a new coding approach that uses physically based models to describe the image objects. The variations in the new approach include motion compensation [I], second-generation coding [2, 31, and model-based coding (or analysis-synthesis coding) [4-81. Motion compensation and second-generation coding techniques are novel in that physical image source models (object-based models) were first intro- duced in coding, but their assumption of 2D structural models in simple motion has limited the flexibility and efficiency of the coding process. On the other hand,

model-based image coding promises a dramatic reduction in bit rates. The requirement of a detailed and specific model, however, has limited its applicability.

In this paper we present a new physically based model to address these limitations. This is called 30 structure extraction coding because of its extensive use of 3D structural properties of objects that are not restricted to any one shape [9]. The idea is to provide compact SL’PIZP representations by understanding the image formation process, such as three-dimensional shape, motion parameters, and location of objects, rather than image representations. Under this scheme the encoder contains a computer vision system to estimate 3D structure and motion information. The decoder is equipped with an “image synthesizer” system that reconslructs images using the structural information transmitted from the encoder.

The motivations for considering the physical 3D structural properties are threefold. First, the use of 3D structural properties enables more efficient and higher-quality coding of image sequences. We can remove the temporal redundancy in surface texture information while still avoiding such visible distortions as blocking and mosquito effects. We can also cope with the 3D effects of moving pictures such as occlusion, thereby refining the efficiency of the prediction process around the object boundaries. Second, a flexible coding control that adap- tively enhances the picture quality of objects of different saliency, e.g., by allocating more bits to the objects fore- most to the viewer, is available. Third, describing such 3D structural properties would open up the possibilities of providing a creative and communicative video environment, among them CG-style video handling (editing), video indexing, and video databases.

To realize such a system, we need to develop algorithms for solving the following two problems: the recovery of 3D structural information from the image sequence and the coding of the image sequences using such 3D structural information.

With regard to the recovery process, there are several visual cues that help us determine 3D structural information. Such cues include motion, stereo, shading, texture, contour, and others [lo]. In studying the computation of “shape from X” process, however, one immediately

1047.3203191 $3.00 Copyright 0 1991 by Academic Press, Inc. All rights of reproduction in any form reserved

3D STRUCTURE EXTRACTION CODING 333

faces the problem that most existing approaches are sensitive to noise. For a general coding scheme applicable to natural images, the recovery process should be robust to noise. Thus, the approach taken in our work is to successively estimate 3D structural information by combining the information from multiple frames [Ill. Two algorithms described in this paper have been designed along this approach.

With regard to the coding process, 3D structural information can be used to achieve compression of image sequences by applying 3D motion compensated interframe prediction and by encoding only the temporal update information. 3D structural information can also be used to compress stereo image sequences effectively by applying “viewpoint” compensated prediction and by removing the spatial correlation between image pairs. This consid- eration leads to the new coding schemes such as 3D motion interpolative coding, 3D motion compensation coding, and viewpoint compensation stereo picture coding.

The algorithms for obtaining reliable 3D information are described in Section 2 (shape,from motion) and Sec- tion 3 (shape from stereo). Each section contains detailed descriptions of the individual stage used in the recovery process as well as the experimental results on real image sequences. The application of the extracted 3D information to image sequence coding is then discussed in Sec- tion 4.

2. 3D STRUCTURAL PROPERTIES FROM MOTION

In this section we describe how 3D structural properties of rigid and nonrigid objects are robustly estimated from motion information in moving pictures. The basic organization of the recovery process is shown in Fig. 1. The process consists of three components: displacement estimation, 3D structure and motion estimation, and surface reconstruction. The following subsections contain detailed descriptions of the individual modules used in the recovery process.

2.1. Displacement Estimation

It is generally assumed that the brightness of corresponding points in two subsequent images is the same. This is formulated in

dl -& = 0, (1)

where I denotes the brightness of a point in the image and t is the time. This can be expanded into the gradient constraint equation

(2)

Image Sequence

T,Z

Surface Reconstruction z

FIG. 1. Block diagram of the “3D shape from motion” process.

where Z,, Z, , and I, are the brightness derivatives in spatial and temporal directions, and (u, u) is the displacement vector [12].

The brightness derivatives in Eq. (2) are estimated from discrete images and will be inaccurate due to noise in the imaging process and to sampling measurement error. To reduce these systematic errors, we first apply the Gaussian filter and then compute derivative values from the smoothed images. Note that the temporal derivatives are also well estimated by applying this smoothing process [13].

We can compute the displacement vectors more reli- ably by considering only the zero-crossings of the La- placian of the Gaussian-filtered (smoothed) image, because the derivation of Eq. (2) is based on the assumption that the gradient estimates depend only on the first derivatives of the brightness function, and the second and higher spatial derivatives of the brightness function are nearly zero along zero-crossings.

Displacement vectors are thus computed along contours formed by detecting zero-crossings of the Laplac- ian of a Gaussian (V2G)-filtered image. The analytic form of the V2G operator [14] is given by

A2G = V2 [& exp i- 9)) , (3)

where u is the space constant of the Gaussian, and we define w (= 2V’%) as the width of the central excitatory region of the operator.

The gradient constraint equation (2), however, does not by itself provide the means for calculating uniquely the displacement vectors. Additional assumptions about the displacement vectors are necessary to make the solution unique. Here we use a simple assumption that displacement vectors are constant over zero-crossings within the small neighborhood of an n, x ~1, window.

Figure 2 illustrates the displacement estimation process outlined above. Figure 2a shows a frame of the head and shoulder image sequence “Rotating Head.” The

334 MORIKAWA AND HARASHIMA

FIG. 2. Displacement estimation. (a) The original image, containing 256 x 240 pixels. (b) The resulting zero-crossing contours. (c) The computed displacement fields on the zero-crossing contours.

contours of zero-crossings corresponding to the sharp, high-contrast edges are illustrated in Fig. 2b. Figure 2c shows the estimated displacement vectors along the zero- crossings. The parameters w = 8, n, = 8 are used in the example.

2.2. 30 Structure and Motion Estimation

The task of the 3D structure and motion estimation stage is to determine 3D structure and motion from the displacement vectors. In this section we first discuss the smoothness-of-motion constraint and then show how 30 information is incrementally recovered through the use of the smoothness-of-motion constraint.

2.2.1. Smoothness-of-motion constraint. To date, almost all research in structure from motion has adopted the rigidity assumption and attempted to recover the structure and motion from a limited number of views of the scene, typically two or three views [12]. However, the rigidity assumption and the two or three view approach can be sensitive to noise and clearly are inappro- priate when dealing with deformable objects.

We instead utilize a smoothness-of-motion constraint and attempt to estimate the structure and motion from a longer sequence [15, 161. The smoothness-of-motion constraint is based on the nature of motion: the moving objects exhibit a smooth motion due to inertia and elastic- L

These motion field equations can be established at pixels where the displacement vectors (u, u) are obtained. The task is then to estimate the 3D information: the motion parameters w(t + 1) and T(t + 1) as well as the depth value Z(t). Note that the motion parameters w and T are global quantities; i.e., there is one rigid body motion for one object. The depth Z, however, may vary spatially. We compute the 3D information incrementally from a longer sequence based on the smoothness-of-motion constraint .

Suppose that at a given instant we have the estimates of the motion parameters w(t) and T(t) and the depth value Z(t). The estimates in general will differ from the true values of motion and denth. We wish to uudate the

ity. The smoothness-of-motion constraint is valid for most physical objects if a frame sequence is acquired at a rate such that no dramatic changes take place between frames.

The advantage of the smoothness-of-motion approach is that it provides a framework for temporal integration of information over a longer sequence and enables the suc- cessive refinement of 3D structure and motion. The moti- vation for using a longer sequence is twofold. First, the multiframe approach reduces the sensitivity to noise by using a larger amount of correlated data. Second, the multiple-frame approach may have the ability to cope with nonrigid objects, since a longer sequence gives more constraint than a traditional two or three view approach. We believe that the smoothness of motion is more general than the traditional rigidity assumption.

2.2.2. Formulation. Let X = (X, Y, Z)T represent a spatial point coordinate, and let x = (x, Y>~ represent a corresponding image plane coordinate. We assume or- thographic projection such that X = x and Y = y, and the Z-axis is aligned with the optical axis. (The formulation is easily extended to perspective projection but is computationally complex.) Let the relative motion of a rigid object in the scene between time t and t + 1 consist of a rotation w(t + 1) = (wx(t + l), o,(t + l), w,(t + l))T and a translation T(t + 1) = (T,(t + l), T,(t + l), T,(t + l))T.

Given this situation, the point P at location X = (X, Y, Z)T will move according to

i% = w(t + 1) x X(t) + T(t + l), (4)

where the dot denotes temporal differentiation. Then the displacement vectors (j;, i) = (u, ZJ) on the image plane, which is the projection of the 3D motion X, can be writ- ten as

u = -w;(t + l)y(t) + w,(t + I)Z(t) + ry,(t + I) (5)

u = wz(t + 1)x(t) - w.r(t + l)Z(t) + zy,,(t + I). (6)

335 3D STRUCTURE EXTRACTION CODING

where estimates to make them closer to the true solution as the new frame t + 1 becomes available. To accomplish this we introduce the update terms Aw(t) = (Am,(t), A+(t), Au,(t)), AT(t) = (AT,(t), AT,(t), AT,(t)), and AZ(t), and update the estimates by

fii = (-o,(t) -. Aw,(t>>yi(t> + (my(t) + Aay(t>> (2i((t) + AZi(t)) + T’(t) + AT,(t)

Oi = (wZ(t) + Ao,(t))xi(t) + (-o,(t) - Aox(

(i?i(t) + AZi(t)) + Ty(t) + AT,(t). (13) o(t + 1) = w(t) + Au(t) (7)

T(t + 1) = T(t) + AT(t) (8)

Z(t) = &t) + AZ(t). (9)

Thus, given the estimates of a time t, we compute the update terms A@(t), AT(t), and AZ(t) based on the smoothness-of-motion constraint, instead of estimating w(t + I), T(t + l), and Z(t) directly. To compute the update terms we substitute Eqs. (7), (8), (9) into Eqs. (5), (6) to obtain

l4 [1 [ 0 -0, + Am, 0, + Aw, = V oz + ho, 0 -u.,- - Aw, 1 f

The smoothness-of-motion constraint means that the update terms Am(t), AT(t), and AZ(t) should be small. Con- sequently, we introduce the functional IIPII’ as a measure of the smoothness, and formulate the smoothness-of-motion constraint as a functional IIP/I’ to be minimized,

llP(t)l12 = f l(Ao(t)112 + ; I/ATW12 + y 2 (AZ(t))‘> i=l

(11)

where (Y, p, and y are scale parameters, 1) . . . // is L2 norm, IZ is the number of the points considered in the computation, and Z; is the depth value of the ith point.

This smoothness-of-motion constraint, i.e., the functional ~IP(/~ to be as small as possible, can be incorporated into the motion field equation (10). That is, to determine the update terms, Am(t), AT(t), and AZ(t), we choose the function E to be minimized as a sum of two terms: the first one is the error function based on Eq. (10) in the least-square sense, and the second term is the cost function (II),

E(A4h AT(t), AZ(t))

= i ((Uj - Lii)2 + (Vi - fii)2) + llP(t)((*, (12) FIG. 3. Estimation of 3D structure information on the zero-crossing ,=, contours (frame 43).

After the update terms Au(t), AT(t), and AZ,(t) have been determined with the minimization of the functional E, new 3D information, w(t + l), T(t + I), and Zi(t), is derived from Eqs. (7), (8), (9).

The operation of the incremental recovery process on a rotating head video sequence is illustrated in Fig. 3. This figure shows the estimates of the depth value along the zero-crossings at frame 43. The estimates are obtained incrementally from the previous estimates o, T, and 2 at frame 42. The eyes and noses can be seen in the estimated depth information. The parameters used in Eq. (ll)area= 100,/3== l,y=3.Thevalueoftheparameter (Y is designed to be almost IO2 times larger than the values of /3 and y, since Ao is radian while AT and AZ are pixel values. Furthermore, these values are chosen from the trade-off among the speed of the convergence, robustness to noise, and ability of coping with nonrigid objects. The empirical studies show that the scheme works well with noise and nonrigid motion, and the sensitivity of the parameter selection to estimation results is not so strong [15, 161.

We have described above the recovery process of com- puting displacement vectors from Eq. (2) and then using them in Eq. (12) to obtain structure and motion information. Instead of this two-stage process, we can also consider the direct method by plugging the motion field equation (10) into the gradient constraint equation (2) and obtaining one equation that links image brightness gradients to structure and motion parameters.


2.3. Surface Reconstruction

The displacement estimation algorithm described in Section 2.1 utilizes the zero-crossings, and then subsequent structure from motion computation (Section 2.2) derives depth information Z,(t) only at the locations of zero-crossings (xi, y;). From the viewpoint of image coding, it may be useful to produce dense or complete surface representations. Some type of filling-in or interpolation of depth information is needed. To this end we reconstruct a smooth surface by interpolating an initial representation consisting of depth information at the zero-crossings. The motivations behind this are twofold. First, the absence of a zero-crossing tells us that no physical property is changing and that the surface is smooth to some extent. Second, the smooth surface makes the inter-frame prediction efficient in the coding process [ 171.

Reconstructing smooth surfaces is a problem that has been dealt with extensively [l&-21]. Among several algorithms, we employ a form of surface splines [19, 201 which is computationally simple. Given the depth information Zi(x, y, t) along zero-crossings, we seek to determine a desired surface function Z(x, y, t) by minimizing

&n(z(x, Y, t)) = &moth + &me

+ C (2(X, Y, f) - Zi(X3 y3 t))‘, (14)

where the subscripts to Z denote partial derivatives, the summation takes place along zero-crossings where there is a depth measurement Z;(x, y, t), and in. is a nonnegative scale parameter. The first term enforces “smoothness” of the surface, and the second “closeness” to the input data.

The current estimate surface Z(x, y, t) can then be used to predict the estimates for the depth value Z(x, y, t + I) in the next iteration. The predicted surface Z(x, y, t + 1) is computed by the geometrical transformation of a surface Z(x, y, t) according to the estimates of the motion

a

FIG. 4. The original images. (a) “Miss America” frame 97.

a FIG. 5. Perspective views of the recovered surface. (a) “Rotating

Head” frame 43. (b) “Miss America” frame 97.

parameters o(t + 1) and T(t + 1). The predicted surface Z(x, y, t + 1) then serves as an initial depth estimate of the zero-crossings in the next iteration to repeat the process described in Section 2.2. A full 3D surface is thus built up over an extended time.

We have applied the above process to the two head and shoulder video sequences Rotating Head and “Miss America” (Fig. 4). Figure 5a is an example of applying the algorithm to the estimated depth values as shown in Fig. 3 (parameter p = 1). The estimated structure tends to be a smooth one, due mainly to the smoothing process of the surface reconstruction for reducing the noise effect. The Miss America example is illustrated in Fig. 5b. The Miss America contains little 3D motion; consequently the estimated shape tends to be flat.

3. 3D STRUCTURAL PROPERTIES FROM STEREO

In the previous section we described the 3D recovery process using the motion cue. Another important passive cue for determining 3D structural properties is stereo. In this section we present the “shape from stereo” process developed for obtaining reliable 3D structural properties.

The most difficult problem in shape from stereo is to establish a correspondence of features from two images. In a number of existing approaches attempts are made to obtain reliable correspondence from a set of stereo images at a given instant, i.e., from a snapshot of stereo images [22]. Because the measurement of image brightness introduces error, such a single snapshot cannot provide very accurate information. To overcome this difficulty, we approach this problem by integrating information from an extended viewing period of stereo images [23]. The scheme consists of three components: matching, prediction with motion estimation, and update of the surface information, as shown in Fig. 6. The following subsections contain detailed descriptions of the individual modules.

In what follows we choose the coordinate system as shown in Fig. 7 with the origin at the left camera’s center of projection (principle distance f) and the optical axis aligned with the Z axis. A point P at location X = (X, Y, Z)T in the scene is imaged on the left and right


Stereo Images T Stereo Images T+l

Motion Estimation

Surface Estimate 2

FIG. 6. Block diagram of the “3D shape from stereo” process.

image planes at pixel locations xl = (x,, ~1)~ and x,. = (x,, Y,)~, respectively.

3.1. Stereo Matching

As shown in the structural block diagram in Fig. 6, as the next stereo image pair t + 1 becomes available, the stereo matching process is performed to obtain dispari- ties, and consequently the depth information 2(x, y, t + 1).

In the matching process, we first compute the zero- crossings of the V*G-filtered left and right images for the symbolic features to be matched. (The precise form of the V*G operator is given in Eq. (3).) In addition to the locations of the zero-crossings, we also extract the contrast sign of the zero-crossings (whether the filtered values change from positive to negative, or negative to positive, as we scan along horizontal lines) [19].

Given a set of zero-crossing representations for each of the images, the matching process takes place between the zero-crossings that are of the same contrast sign, and the 3D depth Z(x, y, t + 1) of the scene along zero-crossings is extracted using triangulation. For each zero-crossing in one image (say the left) at position (x,, y,), a set of candi- date zero-crossing points is selected from the region of the right image

{(x,, Yr) I XI - WC 5 xr 5 XI? Yr = Yl>? (15)

where W, is a given some estimate of the maximum disparity (which we may initially assume to be some arbi- trary value). If more than one match is found within the above region, then we compute the correlation value of an n, x n, window around a zero-crossing in the left and right filtered image, and we accept the match for which the correlation value is largest as well as greater than a given threshold Th. Otherwise, the match at that point is

left ambiguous, and the computation of the depth is not performed.

Since an incorrect match point might satisfy the matching constraint described above, we cannot expect all of the estimated depth information Z(x, y, t + 1) to be accurate. Conceptually, information from multiple frames may be useful to reduce such errors and produce a useful estimate over time. The following subsections describe the way in which multiple frame measurements can be integrated.

3.2. Prediction with Motion Estimation

As shown in the block diagram in Fig. 6, as the next stereo image pair 5 + 1 becomes available, the motion estimation and prediction process is also performed in parallel with the stereo matching process. The task of the motion estimation and prediction stage is to estimate 3D motion and to determine what our current depth estimate 2(x, y, t) will look like at time t + 1.

In the motion estimation stage, we can exploit the incremental nature of the recovery process. At every in- stance in time, the process produces an estimate of the depth 2(x, y, t), and we use this estimate of the depth map as input to the motion estimation stage as shown in the block diagram of Fig. 6. The task of the motion estimation stage is then to compute the motion parameters o and T between time t and t + 1, given a depth map 2(x, y, t), the gradient constraint equation (2) of left image, and the equation of perspective projection

x=f$ (16)

y=f;. (17)

Y

FIG. 7. Relationship between the coordinate system and the cameras.


The motion field equations in the case of the perspective equation can be obtained by substituting (I6), (17) in (4),

u JT, - XT, z + f(-%xy + w,Jx* + f’) - w,yf)

(18)

u JTv - YT; Z + +x2 + f2) + (+xy + w,xf).

(19)

By plugging these motion field equations into the gradient constraint equation (2), we can obtain one equation that links image brightness gradients I.,, I!, I, to motion parameters w and T linearly. We obtain one such linear equation for one pixel in the region of interest. The motion parameters are then computed using a least-square method.

Given the motion parameters o and T, we now apply the prediction process to determine what our current surface estimate Z(x, y, t) will look like at time t + I. This requires a geometrical transformation of a surface i(x, y, t) according to the equation of rigid body motion as shown in (4). The surface Z(x, y, t) is, however, known only through samples on a discrete grid. Consequently, we resample the transformed surface by interpolating the given transformed samples to obtain the predicted surface .Z(x, y, t + 1).

In summary, the motion estimation and prediction process will convert the current estimate Z(x, y, t) into the new estimate Z-(x, y, t + I). This new estimate can then be combined with the depth information Z(x, y, t + I) obtained from the stereo matching process to update the estimate in the manner described below.

3.3. Update of Surface Information

As we see from the block diagram of Fig. 6 the task of the update stage is to take as input depth measurements Z(x, y, t + I) along zero-crossings and combine them with the new estimate Z-(x, y, t + I) to update the estimate. The update process then finds a surface Z(x, y, t + I) that is as close as possible to both Z(x, y, t + I) and Z-(x, y, t + 1). The values of both Z(x, y, t + 1) and Z-(x, y, t + l), however, are subject to errors. To reduce the sensitivity to such errors, we assume that the surface being observed exhibits some amount of smoothness. In essence, the update process can be considered to be the surface reconstruction procedure described in Section 2.3.

Formally, similar to Eq. (14), we compute the desired surface function Z(x, y, t + 1) by minimizing the energy function

Sh f R? (.&x, Y, t + I) - i-(x, y, t + A 1)’ dx dy

+ c (-ax, Y, t + 1) - Z(x, y, t + l))z, G9) k.?‘ED

where the summation takes place over the region D for which depth measurements Z(x, y, t + I) are obtained, and p and h are nonnegative scale parameters. This new surface Z(x, y, t + 1) can then be used as an initial depth estimate in the next iteration of the surface reconstruction procedure and the process repeats itself.

The function E, contains three terms Esmooth, Eclosel, and hose2 7 and the scale parameters p and A determine the relative strength of the three terms. Recently, Heel [24] showed that the surface reconstruction process is essentially identical to the update procedure of the Kalman filter by choosing the scale parameter to be the inverse variance of the estimate. Similarly, to determine the optimal values for p and A, we require analysis of the errors in Z(x, y, t + I) and Z-(x, y, t + I), so that the scale parameters may be chosen to smooth out the effect of these errors.

3.4. Experimentul Results

The multiple-frame approach is a way to increase the robustness and accuracy of the solution by providing additional redundancy to the algorithm. To confirm this result, we use a head and shoulder stereo image sequence (256 X 240 pixels, 8 bits). The scene is shown in Fig. 8. This scene poses a number of difficulties because it contains large regions of uniform brightness, which makes stereo matching difficult. Since true depth maps are rarely available for real scenes, we assess the depth estimates in a more qualitative manner here.

The left and right images are convolved with the V’G filter with central width given by MI = 8. Zero-crossing obtained from these convolutions are shown in Fig. 9. The contrast signs of zero-crossings are displayed using intensity black or white. The disparity map obtained by matching the zero-crossings of the first stereo images is shown in Fig. 10 (the parameters wc = 50, IZ, = 11, and Th = 0.995). The disparity map is displayed using intensity to encode depth, so that the brighter disparity points are closer to the camera. In this experiment, the total number of zero-crossing pixels is 2280, the number of matched zero-crossings is 1837, and the maximum disparity is 43 (pixels). Figures lla, lib, and llc show the


a

b

FIG. 8. The original image sequence. (a) Frame I (b) Frame I I. (c) Frame 21.

perspective views of the recovered surface of the face obtained after 1, 11 and 21 iterations of the algorithm. The parameters p = 1 and A = 0.1 are used in the surface reconstruction equation (20). These perspective views show how the estimate improves over time, in particular, how the shape of the nose becomes more distinct. This can be explained by the fact that the zero-crossings around the nose part become more explicit as the face rotates, and the algorithm integrates this information over time.

FIG. 10. Disparity map (frame I).

4. USING 3D STRUCTURAL PROPERTIES FOR CODING

In the previous sections we have described how the 3D structural properties of rigid and nonrigid objects are robustly estimated from a sequence of images. Given such 3D structural properties of scenes, several manipulations of real images are possible. From the coding point of view, we are interested in how the 3D structural properties are incorporated into coding. We refer to this coding technique as “3D structure extraction coding.”

In this section we present the concept of 3D structure extraction coding and then show the usefulness of 3D structural descriptions in 3D motion interpolative coding, 3D motion compensation coding, and viewpoint compensation stereo image coding.

4.1. 30 Structure Extraction Coding

The image is highly structured and organized: image components can be grouped on the basis of regularities, e.g., closeness, similar form, continuity, and similarity. It is this internal structuring that allows us to proceed to spatial or semantic understanding, and also to obtain the compact and efficient description of images. Image coding techniques can then be considered to have two main processes, as shown in Fig. 12: recovery of these regularities (messages) and coding.

Well-developed waveform coding schemes utilize these regularities as statistical models and extract messages through prediction or orthogonal transform. De-

FIG. 11. Perspective views of the recovered surface using incremental stereo. (a) Frame I. (b) Frame 1 I. Cc) Frame 21. FIG. 9. Zero-crossings with contrast sign (frame I).


FIG. 12. Image coding. FIG. 13. 3D structure extraction coding.

spite the prevalence of waveform coding, the use of simple statistical models generally means that the efficiency, flexibility, and interactivity of the coding process are limited.

One approach to overcome this difficulty is to utilize the structure and motion information, which is among the most important effects that must be determined from image sequences. In the development of this approach, the models describing the structure and motion information become important. With more sophisticated models, more coding gain can be achieved while still avoiding visual distortions. To date, however, only relatively simple models have been investigated, e.g., a 2D rectangular object undergoing 2D translational movement used as in a block-matching method. This is due mainly to the computational complexity and the required real-time processing.

3D structure extraction coding, in contrast, employs a physically based source model of a 30 moving object which is not restricted to any particular shape. Table 1 gives a brief description of image coding techniques and their associated image source models. 3D structure extraction coding thus consists of the 3D information recovery stage and coding stage as shown in Fig. 13. Compared with the existing coding schemes, in this coding technique much heavier emphasis is placed on the first block of Fig. 13 to extract the 3D structural properties, such as three-dimensional shape, motion parameters, and location of objects.

Given such 3D structural properties, we are able to use them for synthesizing new images, for example, by rotating, translating, zooming, or deforming the objects and remapping them onto the image plane. These characteristics allow one to utilize such 3D structural properties for

TABLE 1 Image Coding Techniques and Image Source Models

Coding techniques

Waveform coding Motion compensation coding Second-generation coding

Image source models

Stochastic model

2D planar and motion model 2D structure model (contours, texture)

3D structure extraction coding Model-based coding

3D structure and motion model 3D model (a priori knowledge)

coding image sequences flexibly and efficiently, as we discuss below.

4.2. Examples of 30 Structure Extraction Coding

In this section we present three image coding procedures that rely on the estimated 3D structural properties: 3D motion interpolative coding, 3D motion compensation coding, and viewpoint compensation stereo image coding.

4.2.1. 30 motion interpolative coding. Frame interpolation appears as an attractive scheme for both further reducing the bit rate and obtaining high-quality images. Simple interpolation techniques such as frame repetition and linear interpolation, however, show visible degrada- tions like jerkiness and blurring.

If 3D structure and motion information between transmitted frames is available, it is straightforward to synthesize the skipped images while maintaining the interpolated image’s naturalness. The frame to be interpolated is computed as follows. Let the transmitted frames have the associated temporal position t = 0 for frame k - 1 and t = 1 for frame k. Suppose we denote the motion of objects between frame k and k + 1 by a rotational matrix R(w) and a translation T. The rotational matrix R can be writ- ten in terms of a rotation w = (w,~, w,, wJT as

The frame to be interpolated at t = r (0 < T < 1) is then calculated as a function of o, T, and r as

S(R(m)X + Q-T) = (1 - Q-)S~(X) + 7Sk+j(RX + T), cm

where Sk(X) represents image intensities of frame k corresponding to the location X = (X, Y, Z)T.

The above 3D motion interpolative coding has been investigated by means of computer simulation on two head and shoulder video sequences, Rotating Head and Miss America (Fig. 4). The frequency of the original sequences is reduced to 7.5 frames/s by omitting three frames out of four. From these transmitted frames the 3D structure and motion information is estimated using the


process described in Section 2, and the skipped frames are obtained by 3D motion interpolation.

The 3D motion interpolated images have been compared to results obtained by frame repetition and linear interpolation. Figure 14 shows the images obtained by linear interpolation (left) and 3D motion interpolation (right). From these experiments we find that the 3D information recovery process described in Section 2 performs sufficiently for 3D motion interpolation and generates a better-quality image sequence because the 3D motion of objects is taken into account to preserve the natural impression of motion.

4.2.2. 30 motion compensation coding. The 3D structural properties can also be utilized by incorporating them into a predictive loop of the coding process. We refer to this coding as “3D motion compensation coding.” The key feature of 3D motion compensation coding is that the transmitted 3D information represents one or more global attributes of the image sequences. According to this property, we can avoid visible distortions such as blocking and mosquito effects by transmitting the global 3D parameters of objects, rather than local motion parameters as obtained from the block-matching method. In addition, we can improve the coding efficiency both by coping with 3D effects, such as rotation, occlusion, zoom, and pan, in the prediction stage and by transmitting only those global attributes.

This 3D motion compensation coding method can be viewed from another perspective as a combination of analysis-synthesis coding and conventional waveform coding. Errors occurring at the scene analysis stage are

FIG. 14. 3D motion interpolation: linear interpolated image (left) and 3D motion interpolated image (right). (a) “Rotating Head.” (b) “Miss America.”

FIG. 1.5. 3D motion compensation: frame difference image (left) and 3D motion compensated prediction error image (right). (a) “Rotating Head.” (b) “Miss America.”

compensated using well-developed waveform coding techniques. It thus seems that this approach simplifies the scene analysis task, which is the main topic in the fields of computer vision, and will be a promising solution to the analysis-synthesis coding.

In order to give an impression of the predictability of 3D motion compensation coding, we present examples of 3D motion compensation coding. As with 3D motion interpolation above, the 3D structural properties are estimated incrementally over time as described in Section 2. The right side of Fig. 15 shows the 3D motion compensated prediction error images on the two image sequences Rotating Head and Miss America. The error is amplified by a factor of 10 and truncated to 255, and an inverse bit assignment is used (255 corresponds to black) for Fig. 15. We see that there still remain some errors around edges where the prediction is insufficient. To overcome this, it would be necessary to incorporate the local prediction process as well as the global one. The left side of Fig. 15 shows the simple frame difference images which demonstrate the amount of motion in a sequence of images.

In these experiments, we predict the next frame, as- suming that intensity remains constant as the object moves. In general, however, image intensity changes along motion trajectory. This is easily understood by the commonly used illumination model in computer graphics

1 = lambient + lincident(lr n, r3 V), (23)

where I is image intensity and I, n, r, and v are the unit surface normal, the unit vector to the light source, the

342 MQRIKAWA AND HARASHIMA

unit vector in the direction of reflection, and the unit vector toward the viewpoint, respectively [25]. Thus if illumination can be modeled as ambient light (the light incident from the environment, not from specific light sources), the simple prediction process in our experiments works well. Otherwise, we must consider the second term Iincident and compensate the image intensity changes due to the motion for improving the prediction efficiency. Empirical studies show, however, that a simple prediction process used in our experiments is suffi- cient if the motion between frames is small and if images contain no highlights.

4.2.3. “Viewpoint” compensation coding. Devel- oping a technique for 3D video systems and holographic television is a long-held dream of both optical and elec- tronics researchers. One of the technical problems in developing such 3D systems involves the techniques for coding the extremely large 3D information. At this point, we can exploit the 3D nature of the information to be transmitted and utilize the estimated 3D structural properties in coding.

The idea of the coding is the following. Given such 3D structural properties, we can synthesize new images from different viewpoints by geometrical transformations. This transformation performs the necessary rotations, translations, and perspective transformations according to the configuration of the coordinate systems in which stereo (or multiviewpoint) images are taken. This transformed version of one image would then be a good pre- dictor of other images, and by this we remove the spatial correlation of stereo images. The 3D structural properties are thus utilized in a spatial prediction process, instead of the temporal prediction as in the 3D motion compensation coding described in Section 4.2.2. We refer to this coding as “viewpoint compensation coding,” and we consider this coding scheme the extension of the disparity compensation coding [26] which measures the disparity using the local method, e.g., the block-matching method.

An example of this coding is shown in Fig. 16. We use a stereo image sequence, as shown in Fig. 8. The 3D structural properties are estimated from the stereo image sequence incrementally over time as described in Section 3. The configuration of the two cameras is known. Using this knowledge, we synthesize the predicted left image only from the right image as shown in Fig. 16a. In this example the areas caused to appear or disappear by the prediction process are linearly interpolated to obtain the predicted image. The right side of Fig. 16b shows the prediction error image between the original image and the predicted image which is shown in the right side of Fig. 16a. The left side of Fig. 16b shows the prediction error

a

FIG. 16. Viewpoint compensation stereo image coding. (a) The original right image (left) and the predicted left image from the original right image using viewpoint compensation (right). (b) The error image between the original left image and the translated right image (left) and the error image between the original left image and the viewpoint compensated image (right).

image between the original left image and the predicted image obtained by simply translating the right image so that the overall error energy is minimum. From this example we see that viewpoint compensation coding copes with the 3D nature of the stereo images.

5. CONCLUSION

We have presented two multiframe algorithms for the dense estimation of 3D structure and motion information, one from motion and one from stereo, and also their application to image coding, i.e., 3D structure extraction coding.

The key idea of these 3D recovery algorithms is to successively estimate 3D structural properties by combining information from multiple frames. The use of a longer sequence enables the 3D recovery process to be robust to noise. The preliminary results show that these schemes work well with noisy natural images.

The assumption that we have made at this recovery process is the segmentation of scene. To overcome this, it would be necessary to segment the image into regions corresponding to independently moving objects and then run the recovery process on each region independently. Toward this end, we are currently studying the incremental segmentation algorithm which includes the dynamic occlusion analysis for improving the segmentation results over time [l 11.


As for coding, we have presented the concept of 3D structure extraction coding and given as examples three image coding procedures that rely on the 3D structural properties in scene, 3D motion interpolative coding, 3D motion compensation coding,. and viewpoint compensation coding. Instead of the statistical correlations or 2D properties used in conventional coding techniques, 3D structure extraction coding computes the 3D structural properties to be transmitted, such as three-dimensional shape, motion, and location of objects. Using such 3D structural properties inherent in image sequences, we can achieve efficient and flexible coding while still avoiding visual distortions.

Such a 3D structure extraction coding system, however, is still in a very early stage and not complete. It seems clear that the efficiency of the coding schemes must be evaluated quantitatively. The interesting but difficult problem is developing the method for coding the estimated 3D structural properties along with the appro- priate coding strategy. The most straightforward way of coding the estimated surface is to transmit the depth information only along the zero-crossings from which the 3D information is obtainable. Toward the goal of achiev- ing much higher efficiency and flexibility we intend to develop a compact parametric representation of a 3D scene. At this point, we have an object representation problem that has been the main topic in computer vision for developing a recognition system. We will also pay attention to the coder control criterion rather than the mean square error criterion.

ACKNOWLEDGMENTS

The authors had helpful conversations with P. R. Hsu and thank S. Aoki for her help in implementing the stereo algorithm.

I.

2.

3.

4.

5.

REFERENCES

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

H. G. Musmann, P. Pirsh, and H. J. Grallert, Advances in image coding, Proc. IEEE, 73, 4, Apr. 1985, 523-548. 23. J. K. Yan and D. J. Sakrison, Encoding of images based on a two- component source model, IEEE Truns. Comm. 25, 1 I, Nov. 1977, 1315-1322. 24. M. Kunt, A. Konomopoulous, and M. Kocher, Second-generation image coding, Proc. IEEE 73,4, Apr. 1985, 549-574. H. Harashima, K. Aizawa, and T. Saito, Model-based analysis syn- 25. thesis coding of videotelephone images, IEZCE E72, 5, May 1989, 452-459. 26. K. Aizawa, H. Harashima, and T. Saito, Model-based analysis synthesis image coding system for a person’s face, .Signcr/ Process. image Comm. 1, 2, Oct. 1989, 139-152.

H. G. Musmann, M. Hiitter, and J. Ostermann, Object-oriented analysis-synthesis coding of moving images, Signal Process. Image Cornm. 1, 2, Oct. 1989, 117-138. D. E. Pearson, Model-based image coding, in Proceedings, IEEE GLOBECOM-89, Dallas, TX, Nov. 1989, pp. 554-558. R. Forchheimer and T. Kronander, Image coding-From waveform to animation. IEEE Truns. Accoust. Speech Signal Process. 37, 12, Dec. 1989, 2008-2023. H. Morikawa and H. Harashima, 3-D structure extraction coding of image sequences, in Proceedings, International IEEE Conference on Accoustics, Speech, und Signal Processing, Albuquerque, NM, Apr. 1990, pp. 1969.-1972. J. Aloimonos, Visual shape computation, Proc. IEEE 76, 8, Aug. 1988, 899-916. H. Morikawa, E. Kondo, and H. Harashima, Structural description of moving pictures for coding, in Picture Coding Symposium (PCS’91), Tokyo, Japan, Sept. 1991. J. Aggarwal and N. Nandhakumar, On the computation of motion from sequences of images, Proc. IEEE 76, 8, Aug. 1988, 917-935. J. K. Kearney, W. B. Thompson, and D. L. Boley, Optical flow estimation: An error analysis of gradient-based methods with local optimization, IEEE Truns. Pattern Anal. Much. Intell. 9, 2, Mar. 1987, 229-244. D. Marr and E. C. Hildreth, Theory of edge detection, Proc. Roy. Sot. London Ser. B 207, 1980, 187-217. H. Morikawa and H. Harashima, Rigid and nonrigid motion analysis: Robust recovery of 3-D structure and motion, in IAPR Internu- tional Workshop on Machine Vision Applications (MVA ‘90). To- kyo, Japan, Nov. 1990, pp. 283-286. H. Morikawa and H. Harashima, Structure and motion of deformable objects from image sequences, in Proceedings, International IEEE Conference Accoustics, Speech, and Signal Processing, To- ronto, Canada, May 1991, pp. 2433-2436. D. Pearson, Texture mapping in model-based image coding, Signal Process. Image Comm. 2, 4, Dec. 1990, 377-395. R. Franke, Scattered data interpolation: Tests of some methods, Muth. Comp. 38, Jan. 1982, 181-199. W. E. L. Grimson, From Images to Surfices: A Computational Study of the Human Early Visual System, MIT Press, Cambridge, MA, 1981. D. Terzopoulos, Multilevel computational processes for visual surface reconstruction, Comput. Vision Graphics Image Process. 24, 1983, 52-96. A. Blake and A. Zisserman, Visual Reconstruction, MIT Press, Cambridge, MA, 1937. U. R. Dhond and J. K. Aggarwal, Structure from stereo-A re- view, IEEE Trans. Systems Man Cybernet. 19,6, Dec. 1989, 1489- 1510. H. Morikawa, S. Aoki, and H. Harashima, Determining 3-D Struc- ture and Motion from a Sequence of Stereo Images, Tech. Rep. PRU89-57, IEICE, Japan, Sept. 1989. [In Japanese] J. Heel, Temporally integrated surface reconstruction, in Proceed- ings, 3rd International Conference on Computer Vision, Osaka, Japan, Dec. 1990, pp. 292-295. J. D. Foley and A. van Dam, Fundumentuls of Interactive Com- puter Graphics, Addison-Wesley, Reading, MA, 1984. M. E. Lukas, Predictive coding of multi-viewpoint image sets, in Proceedings, International IEEE Conference on Accoustics Speech, and Signal Processing, Tokyo, Japan, Apr. 1986, pp. 521- 524.


HIROYUKI MORIKAWA received the B.E. and M.E. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1987 and 1989, respectively. He is currently working toward the Dr.E. degree in electrical engineering at the University of Tokyo. His research interests are in the areas of image coding, image communication, and computer vision.

HIROSHI HARASHIMA received the B.E., M.E., and Dr.E. degrees in electrical engineering from the University of Tokyo, Tokyo, Japan, in 1968, 1970, and 1973, respectively. From 1983 to 1975 he was a full-time lecturer, from 1975 to 1990 he was an associate professor, and now he is a professor of electrical engineering at the University of Tokyo. In 1984 he was a visiting associate at Stanford University, Stan- ford, California. His research interests include communication theory, coding theory, digital modulation, image coding and processing, and digital signal processing. He received the 1973 Yonezawa Memorial Award, the 1979 Achievement Award, and the 1989 Best Paper Award from the IEICE Japan.

3d structure extraction coding of image sequences

Documents