in copyright - non-commercial use permitted rights / license: … · 2020. 3. 26. · compression...

9
Research Collection Report Dynamic point cloud compression for free viewpoint video Author(s): Lamboray, Edouard; Waschbüsch, Michael; Würmlin, Stephan; Pfister, Hanspeter; Gross, Markus H. Publication Date: 2003 Permanent Link: https://doi.org/10.3929/ethz-a-006731956 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Upload: others

Post on 30-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

Research Collection

Report

Dynamic point cloud compression for free viewpoint video

Author(s): Lamboray, Edouard; Waschbüsch, Michael; Würmlin, Stephan; Pfister, Hanspeter; Gross, Markus H.

Publication Date: 2003

Permanent Link: https://doi.org/10.3929/ethz-a-006731956

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

CS Te

chni

cal R

epor

t #43

0, D

ecem

ber 1

0, 2

003

Institute of Computational Science

Eidgenössische Technische Hochschule ZürichSwiss Federal Institute of Technology Zurich

Dynamic Point Cloud Compressionfor Free Viewpoint Video

Edouard Lamboray, Michael Waschbüsch, Stephan Würmlin, Hanspeter Pfister*, Markus Gross

Computer Science DepartmentETH Zurich, Switzerland

e-mail: {lamboray, waschbuesch, wuermlin,grossm}@inf.ethz.chhttp://graphics.ethz.ch/

*MERL - Mitsubishi Electric Research LaboratoriesCambridge, MA, USA

e-mail: [email protected]://www.merl.com/

Page 3: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

1 ETH CS Technical Report #430, December 10, 2003

Dynamic Point Cloud Compression for Free Viewpoint Video

Edouard Lamboray Michael Waschbüsch Stephan Würmlin Hanspeter Pfister* Markus GrossComputer Graphics Laboratory, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland

*MERL - Mitsubishi Electric Research Laboratories, Cambridge, MA, USA{lamboray, waschbuesch, wuermlin, grossm}@inf.ethz.ch [email protected]

AbstractIn this paper, we present a coding framework addressingthe compression of dynamic 3D point clouds which repre-sent real world objects and which result from a videoacquisition using multiple cameras. The encoding is per-formed as an off-line process and is not time-critical. Thedecoding however, must allow for real-time rendering ofthe dynamic 3D point cloud. We introduce a compressionframework which encodes multiple attributes like depthand color of 3D video fragments into progressive streams.The reference data structure is aligned on the originalcamera input images and thus allows for easy view-depen-dent decoding. The separate encoding of the object's sil-houette allows the use of shape-adaptive compressionalgorithms. A novel differential coding approach permitsrandom access in constant time throughout the completedata set and thus enables true free viewpoint video.

1. IntroductionIn the past, many 3D reconstruction methods have been

investigation but no method was presented which is tai-lored to streaming and compression of free viewpointvideo. In this paper, we present a compression frameworkfor 3D video fragments which is a dynamic point basedrepresentation tailored for real-time streaming and displayof free viewpoint videos [14]. Our representation general-izes 2D video pixels towards 3D irregular point samplesand thus we combine the simplicity of conventional 2Dvideo processing with the power of more complex polygo-nal representations for free viewpoint video. 3D videofragments are generic in the sense that they work with anyreal-time 3D reconstruction method which extracts depthfrom images. Thus the representation is quite complemen-tary to model-based scene reconstruction methods usingvolumetric (e.g. space carving, voxel coloring), polygonal(e.g. polygonal visual hulls) or image-based (e.g. image-based visual hulls) scene reconstruction. Therefore, theframework can be used as a nice abstraction of the 3Dvideo representation, its streaming and compression from3D reconstruction.

2. Problem analysis2.1. Prerequisites

The input data of free viewpoint video systems acquir-ing real-world objects typically consists of multiple con-centric video sequences of the same scene. Before

recording, the cameras have been calibrated and for eachcamera intrinsic and extrinsic calibration parameters areavailable to both encoder and decoder. The multiple 2Dvideo sequences providing the input data are recorded withsynchronized cameras. Additionally, we have at our dis-posal for every input frame an image mask, telling whichpixels belong to the object of interest, i.e. are foregroundpixels, and which pixels belong to the background.

After the input images have been processed by a 3Dreconstruction algorithm, which is not specified in thisdocument, each input frame provides for each foregroundpixel a depth value describing, in combination with thecamera calibration parameters, the geometry of the object,and a color value. For each foreground pixel, a surface nor-mal vector and a splat size can be encoded as optionalattributes. In general, it is possible to encode any attributesdescribing the visual appearance of an object. Given a spe-cific rendering scheme or target application, any subset ofthe above pixel attributes might be sufficient.

In summary, our compression framework can beapplied to every static or dynamic 3D data set which canbe completely described by a set of concentric 2D views.In practice, this applies to all acquisition setups for 3Dreconstruction of real world objects, e.g. a 3D scanner pro-vides only one single image, but our coding frameworkstill applies. Furthermore, the coding framework cansmoothly be extended to an arbitrary camera setup whereonly a few cameras see the object at a time.

2.2. Application domainsOur streaming format for dynamic 3D point clouds

should address the application domains depicted in Figure2. A high compression ratio is efficient if the 3D pointcloud is transmitted via a low bandwidth network. But a

Figure 1: Input data: a) Concentric camera setup. b) Example image from one camera. c) Segmenta-tion mask of image b).

Page 4: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

2 ETH CS Technical Report #430, December 10, 2003

high compression ratio is difficult to achieve at a lowdecoding complexity. In our opinion, a 3D compressionframework should allow for decoding at a relatively lowcomplexity, and thus support a wide range of target hard-ware devices. The problem of low bandwidth transmis-sions should be addressed by building upon a progressiverepresentation of the data. In fact, bandwidth and CPU per-formance are often correlated, e.g. high-end computingnodes have, in general, access to a broadband network con-nection and nodes with a low bandwidth network accesshave also limited processing power.

2.3. Goals

The goal of our research is compression of free view-point video. In this paper, we distinguish between con-strained and unconstrained free viewpoint video.

Definition 1: Constrained free viewpoint videoAfter decoding, the point cloud can be renderedfrom any possible direction, but either only smallviewpoint changes are allowed during rendering,or discontinuities in rendering are tolerated inpresence of large viewpoint changes.

Definition 2: Unconstrained free viewpoint videoAfter decoding, the point cloud can be renderedfrom any possible direction, the viewpoint being afunction of the rendering time and the discontinui-ties during rendering are minimized.

Examples of spatio-temporal viewpoint trajectories aredepicted in Figure 3. In the constrained case, the trajectoryis constrained to a narrow band. In the unconstrained case,the trajectory can be arbitrary.

Taking into account that the number of cameras ispotentially large and that real-time decoding is required, itis not realistic on current hardware to decode all the cam-eras first, and then to render the scene for a given view-point. Hence, it is necessary to reduce the set of processedcameras already during decoding. Thus, view dependentdecoding must be supported by the proposed codingframework. In the remainder of this section, we defineview dependent decoding.

Consider the following variables:

Note that is the result after decoding and is the result after rendering. In any case, we

have .

Definition 3: Optimal view dependent decodingOptimal view dependent decoding is achieved if

.

This implies that the decoder, for a given renderingframe, only decodes information of the correspondingrecording time frame which becomes visible in the finalrendering. This strong condition can be relaxed using aweaker formulation:

Definition 4: View dependent decodingView dependent decoding minimizes the cardinalof the complement .

This implies that the decoder, for a given renderingframe, maximizes the ratio of decoded information of thecorresponding recording time frame which is visible in thefinal rendering versus the total amount of decoded infor-mation for the given rendering instant.

2.4. Playback featuresThe 3D streaming data format should address the fol-

lowing features:

Figure 2: Application domains for streaming freeviewpoint video.

CPU performance

Ban

dw

idth

+ + + +

++

+

+Low decoding complexity, low compression

High decoding complexity, high compression

Low decoding complexity, progressive representation

Figure 3: Possible viewpoint trajectories for con-strained and unconstrained free viewpoint video.

: the recording time, which is a discrete time corresponding to the frame rate of the recording cameras.

: the rendering time which is completely asynchronous to the recording time and has a potentially much higher frame rate.

: the viewpoint as a function of the ren-dering time

: the set of data decoded for a given viewpoint and a timestamp

: the set of data decoded for a given viewpoint and a timestamp which becomes visible after rendering

time (t)

space (xyz)

unconstrained fvvconstrained fvv

cam 1 cam 2 cam 3 cam N

frame 0

frame M

. . .

. . .

t

θt

v θ( )

D v θ( ) t,( )v θ( ) t

R v θ( ) t,( )v θ( ) t

D v θ( ) t,( )R v θ( ) t,( )

R D⊆

R v θ( ) t,( ) D v θ( ) t,( )=

D v θ( ) t,( )\R v θ( ) t,( )

Page 5: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

3 ETH CS Technical Report #430, December 10, 2003

• Multi-resolution: scalability and progressivity withrespect to resolution. This feature can be achievedusing either a progressive encoding of the data, as pro-posed by the embedded zero-tree wavelet coding algo-rithm (EZW) [12], progressive JPEG or progressivesampling of the data, as proposed in [14]. The pro-gressive encoding approach is more promising thanthe progressive sampling.

• Multi-rate: scalability with respect to time, i.e. play-back of the sequence is possible at a different framerate than the recording frame rate. Backward playbackshould also be possible.

• View-dependent decoding: In this paper, we addressthe problem of encoding data for view-dependentdecoding. The algorithm for deciding which camerasare required for the view-dependent decoding of agiven rendering frame is similar to the techniquedescribed in [14], i.e., given a viewpoint and the cam-era calibration data, we compute the contributing cam-eras and read the data accordingly before decoding.

3. CompressionCompression is achieved by exploiting coherence in the

3D point cloud data. In our compression framework, whichuses the recorded camera images as input data structure,coherence can be possibly exploited in image space, incamera space and in the time dimension.

3.1. Coherence in image spaceSimilar to standard image compression algorithms, 2D

transforms can be employed to our input data and, thus, the2D coherence in the data is exploited.

However, we are only interested in a part of the imagewhich lies within the silhouette. We propose a shape-adap-tive wavelet encoder, which puts the colors of the relevantpixels into a linear order by traversing the silhouette scanline by scan line in alternating directions, see Figure 4. Wethen apply a one-dimensional wavelet transform to this lin-ear list of coefficients using the lifting scheme. The result-ing wavelet coefficients are lossily encoded up to a desiredbit rate by a one-dimensional zero-tree algorithm andfinally compressed with an arithmetic coder. To allow for acorrect decompression, the silhouette must be stored loss-lessly along with the compressed data. The compressionperformance of the 1D and 2D approaches is analyzed inSection 6.1.

3.2. Coherence in timeIn the case of constrained free viewpoint video, we can

reasonably admit that in most case the sequence is playedback with increasing t and at normal playback speed.Hence, we can use the information from previous framesfor building the current frame.

For each camera i, is the decoding function whichreturns the contribution to the 3D point cloud of therespective camera and for the time t. If temporal coherenceis exploited by using the information from previousframes, the decoding function thus has the form

(1)

with and where describes the specific con-tribution of frame t.

Note that all popular 2D video compression algorithmsbelong to this class of codecs. This approach is also feasi-ble for constrained free viewpoint video according to Defi-nition 1.

In case of unconstrained free viewpoint video, however,it is more difficult to exploit temporal coherence. Thedecoder is supposed to implement a function f, whichreturns a 3D point cloud at any time instant during ren-dering. This implies a viewpoint and a mapping func-tion which maps the rendering time to therecording time. A weight function tells us, given theviewpoint v, which cameras contribute to the visible part ofthe 3D point cloud. In a first approximation, we canassume that returns 1 if a camera has a visible contri-bution and 0 if not.

We get

Assume . If , the decoding of requires the decoding of with and the con-

dition of view-dependent decoding is violated. Hence,optimal view-dependent decoding can only be imple-mented using decoders which can be defined as

(2)

with independent of t. Thus, a decoder supporting unconstrained free view-

point video needs to implement decoding in constant timefor frames addressed in a random order.

3.3. Coherence in camera spaceFirst we discuss the coherence in camera space for the

constrained free viewpoint case where .If camera coherence is exploited, the decoding function

has the form with . (3)

In general, we need the data from two or three camerasfor rendering a 3D point cloud from any arbitrary view-point. However, unlike in the time dimension, we do nothave a preferential direction in space, and it is not possible

Figure 4: Scan line traversal of pixels inside a sil-houette.

=

=

ci t( )

ci t( )

ci t( ) ci t'( ) ∆ci t( )+=

t' t< ∆ci t( )

θv θ( )

t m θ( )=w v( )

w v( )

f v θ( ) t,( ) w v θ( )( ) f t( )⋅

w0 θ( ) … wN 1– θ( )c0 t( )

…cN 1– t( )

θ' m 1– t'( )= wi θ'( ) 0=ci t( ) ci t'( ) t t'≠

ci t( ) Ci ∆ci t( )+=

Ci

v θ( ) v=

ci t( ) cj t( ) ∆ci t( )+= i j≠

Page 6: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

4 ETH CS Technical Report #430, December 10, 2003

to make a reasonable prediction during encoding, fromwhich viewpoints the point cloud will preferentially berendered. Hence, the combination of cameras for exploit-ing camera space coherence can only be done in an arbi-trary fashion, or, at best, combining the cameras such thatan optimal compression ratio is achieved. However, thiscombination of cameras can again lead to unfortunate con-figurations, in which several cameras, which are notdirectly visible for need to be decoded.

As in the time dimension, the problem becomes evenmore difficult for unconstrained free viewpoint decoding,if we need to calculate (3) and we additionally have

.From the precedent analysis, we conclude that tradi-

tional 2D video coding algorithms do not fulfill therequirements of unconstrained free viewpoint video. InSection 4, we propose a coding scheme for constrainedfree viewpoint video which uses conventional video cod-ing algorithms and introduce a new coding scheme forunconstrained free viewpoint video which follows theguidelines of Equation (2), i.e. allows spatio-temporal ran-dom access in constant time.

4. Compression frameworkThe underlying data representation of our free view-

point video format is a dynamic point cloud, in which eachpoint has a set of attributes. Since the point attributes areseparately stored and compressed, a referencing scheme,allowing for the unique identification between points andtheir attributes is mandatory. Using the camera images asbuilding elements of the data structure, each point isuniquely identified by its position in image space and itscamera identifier. Furthermore, looking separately at eachcamera image, we are only interested in foreground pixels,which contribute to the point cloud describing the 3Dobject.

Thus, we use the segmentation mask from the cameraimages as reference for all subsequent coding schemes. Inorder to avoid shifts and wrong associations of attributesand points, a lossless encoding of the segmentation mask isrequired. This lossless segmentation mask must be at thedisposal of all encoder and decoders. However, all pixelattributes can be encoded by a lossy scheme. Nevertheless,a lossless or almost lossless decoding should be possible ifall data is available. The stream finally consists of keyframes and delta frames, which rely upon a predictionbased on the closest key frame. The key frame coding andthe prediction is different in case we want to encode con-strained or unconstrained free viewpoint video.

The overall compression framework is depicted in Fig-ure 5. From the segmentation masks and the camera cali-bration data, a geometric reconstruction of the object ofinterest is computed. The output of the geometric recon-struction are 3D positions, surface normal vectors, andsplat sizes. Note that any geometric reconstruction schemewhich is able to deliver these streams from the providedinput data can be used. The data streams are compressedand, along with the texture information and the segmenta-tion masks, multiplexed into an embedded, progressivefree viewpoint video stream. The camera calibration datais encoded as side information. In the following sections,the specific coders are described in detail.

Note that all color coding is performed in the 4:1:1YUV format. The compression performance results as sug-gested by the MPEG standardization committee should begenerated in YUV color space [5]. However, our frame-work is also capable of handling texture data in other for-mats. Furthermore, instead of computing and encoding thesurface normal vectors and the splat sizes during the recon-struction and encoding process, these attributes can also beevaluated in real-time during rendering. In that case, noexplicit encoding for surface normals and splat sizes isnecessary.

5. Free viewpoint video coding

5.1. Constrained free viewpoint videoOur compression framework can use existing MPEG

coders in the case of constrained free viewpoint video cod-ing. The silhouettes are encoded using lossless binaryshape coding [3]. The depth values are quantized as lumi-nance values and a conventional MPEG-4 video objectcodec can be utilized. Surface normal vectors can be pro-gressively encoded using an octahedron subdivision of theunit sphere [1]. The splat sizes can simply be quantized toone byte and the codewords are represented in the grayscale MPEG video object.

The complete decoding of one constrained free view-point video frame requires for each reconstruction viewthree gray scale MPEG video objects (depth, surface nor-mal, splat size) and one color video object, i.e. if tworeconstruction views are used a total of six gray scalevideo objects and two color video objects.

5.2. Unconstrained free viewpoint videoFor unconstrained free viewpoint video we propose a

coding scheme which uses averaged information as keyframes. The principle of average coding is depicted in Fig-ure 6. In each foreground pixel of a time window, an aver-age value for each attribute is computed. This informationbecomes the key frame, which is, within the respective

v θ( )

wj θ'( ) 0=

Figure 5: Compression framework.

Camera CalibrationData

Cam 1

Cam 2

Cam N

Segmentation Masks (Binary Images)

Shape Coding

Geometry reconstruction

PositionCoding

Surface Nor-mal Coding

Splat SizeCoding

Camera 1 Camera 2 Camera N

Texture Data (YUV 4:1:1 format)

Color Coding

Side Info

M u l t i p l e x e r

Embedded, progressive free viewpoint video stream

Camera calibration

Page 7: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

5 ETH CS Technical Report #430, December 10, 2003

time window, independent from the recording time. Thedelta frame is simply the difference information of theoriginal frame and the respective key frame, It is true thatfor every window of N frames which use the same keyframe, we need to encode N+1 image frames. However, wecan use pretty high values for N and thus, the additionalcost of coding the key frame is distributed over a largenumber of delta frames.

The streams can thus be decomposed into a base layerwhich contains the averaged key frames, and an enhance-ment layer composed by the delta frames which allow todecode the complete information for every frame, see Fig-ure 7. It appears that the key frames are only independentof the rendering time inside each window. But the win-dows can cover a large number of frames without limitingthe navigability of the stream, as it is the case with codersthat follow the scheme of Equation (1).

Figure 10 illustrates the principle of average codingusing a window of texture data of one single camera.

The silhouettes are again encoded using losslessMPEG-4 binary shape coding [3]. All point attributes areencoded using the average coding scheme: the key frameaverages the attributes of the window; the delta frame con-tains the difference image with respect to the key frame.

The depth values are encoded using the shape-adaptive1D wavelet approach described in Section 3.1 and the col-ors are encoded using the embedded zero-tree waveletalgorithm for image compression [12].

If we assume large windows, the decoding of the keyframes can be neglected and the complexity is about thesame than in the constrained free viewpoint case. If surfacenormals and splats are determined during rendering, weneed for each reconstruction camera one binary shapeimage, one gray scale image and one color image.

5.3. MultiplexingSince all attributes are encoded independently and pro-

gressively, a stream satisfying a given target bit rate can beobtained by multiplexing the individual attribute bit

streams into one bit stream for transfer. Appropriate contri-butions of the single attribute bit streams are determinedaccording to desired rate-distortion characteristics.

For example, a bit stream of 300 kbps may contain30 kbps of shape information, 60 kbps of position informa-tion, 120 kbps of color information, 45 kbps of surfacenormal information and 45 kbps of splat size information.

5.4. Extension to full dynamic 3D scenesThe encoders, described so far for video objects, can

also be used to encode entire dynamic scenes. Distinctobjects in the scene can be encoded in different layers.Static objects are described by a single key frame. A scenegraph, which is stored as side information, describes thespatial relations between the different layers. View depen-dent decoding is again enabled by decoding only those lay-ers which are visible from the current arbitrary viewpoint.

6. Preliminary results

6.1. Wavelet filters for depth image compressionWe investigated the rate-distortion function of the EZW

image compression algorithm for different wavelet filtersapplied to a depth image. The results are presented in Fig-ure 8 and Figure 9. As input we used a 512x512 depthimage generated from a synthetic 3D human body. The bitrate is given in bits per surfel, where we consider a fore-ground pixel in the depth image as surfel. Our exampledepth image contained 40k surfels.

6.2. Bit rates for unconstrained free viewpoint video

We compressed an MPEG test data sequence with ourunconstrained free viewpoint video codec [5]. The testsequence consists of input data from 25 different camerapositions, each with a resolution of 320x240 pixels. Weassume a frame rate of 25 frames per second. The averagekey frame covers a window of 200 frames which corre-sponds to a time of 8 seconds. Only disparity and colorsare coded, normals and splat radii are estimated on the flyduring rendering. For view-dependent decoding and ren-

Figure 6: Principle of average coding

Figure 7: Decomposition of a stream into base andenhancement layers

Average

EZW

EZW

EZW

EZW

Camera i

Frame k*N

Frame k*N+1

Frame k*N+N-1

Keyframe k

Progressively encoded keyframe k

Progressively encoded deltaframe k*N

Progressively encoded deltaframe k*N+1

Progressively encoded deltaframe k*N+N-1

Delta- frame 0

Delta- frame 1

Delta- frame 2

Delta- frame N-1

Keyframe 0Base layer

Enhancement layer

Figure 8: Rate-distortion diagram for differentEZW encoded wavelet transform coefficients of asynthetic depth image.

EZW compression of depth values

35

40

45

50

55

60

65

0 0.50 1.00 1.50 2.00

Bits per surfel

PS

NR

[d

B]

Haar Daub4 Daub6 Villa3 Villa4 Villa5 Villa6 Biorthognal 1-D

Page 8: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

6 ETH CS Technical Report #430, December 10, 2003

dering the three cameras nearest to the virtual viewpointare chosen. The resulting data stream consists of approxi-mately 8000 surfels per frame. We compressed it with sev-eral example bit rates specified in Table 1. To furtherimprove the compression performance we use downscaleddepth images which are re-expanded to full resolution dur-ing reconstruction. The silhouette images have to be com-pressed lossless which produces an additional amount ofdata of 0.3 bits per surfel. Snapshots from the respectivesequences are shown in Figure 11.

7. Related workIn [14], we propose a point-based system for real-time

3D reconstruction, streaming and rendering which doesnot make any assumptions about the shape of the recon-structed object. Our approach uses point samples as astraightforward generalization of 2D video pixels into 3Dspace. Thus a point sample holds, additionally to its color,a number of geometrical attributes and a one-to-one rela-tion between 3D points and foreground pixels in therespective 2D video images is guaranteed.

For shape coding, we rely on lossless coding techniquesfor binary images [3, 7]. The compression is based on acontext-sensitive adaptive binary arithmetic coder.

In our free viewpoint video framework, images arecompressed using a wavelet decomposition followed byzero-tree coding of the respective coefficients [11, 12].

Finally, an arithmetic entropy coding is applied [8]. Alter-natively, MPEG-4 video object coding, based on shapeadaptive discrete cosine transform coding, can be used [6].

Several coding techniques for large but static point rep-resentations have been proposed in the literature. Rusink-iewicz and Levoy presented Streaming QSplat [10], aview-dependent progressive transmission technique for amulti-resolution rendering system, which is based on ahierarchical bounding sphere data structure and splat ren-dering [9]. In [1], Botsch et al. use an octree data structurefor storing point sampled geometry and show that typicaldata sets can be encoded with less than 5 bits per point forcoding tree connectivity and geometry information.Including surface normal and color attributes, the authorsreport memory requirements between 8 and 13 bit perpoint. A similar compression performance is achieved by aprogressive encoding scheme for isosurfaces using anadaptive octree and fine level placement of surface sam-ples [4].

Briceno et al. propose to reorganize the data fromdynamic 3D objects into 2D images [2]. This representa-tion allows to deploy video compression techniques forcoding animated meshes. This approach, however, fails forthe case of unconstrained free viewpoint video.

Vedula et al. developed a free viewpoint video systembased on the computation of a 3D scene flow and spatio-temporal view interpolation [13]. However, the coding ofthe 3D scene flow representation is not addressed.

8. ConclusionsThis paper introduces a compression framework for

free viewpoint video using a point-based data representa-tion. The deployment of a novel average coding schemeenables for an unconstrained spatio-temporal navigationthroughout the 3D video stream. All data is progressivelyencoded and hence a 3D video stream can be generated fordifferent target bit rates.

In the future, additional components and investigationsare required. The framework is independent of the 3Dreconstruction method and we currently use a variant ofthe image-based visual hulls method, but other reconstruc-tion algorithms should be investigated. Furthermore, theframework is independent of the actual attribute codecs,the only requirements on the codecs is the support for pro-gressive decoding capabilities. The specific codecs alsoneed to be investigated in detail. Finally, an algorithm opti-mizing the window length for average coding should bedeveloped.

References[1] M. Botsch, A. Wiratanaya, and L. Kobbelt. Efficient high

quality rendering of point sampled geometry. In Proceedingsof the 13th Eurographics Workshop on Rendering, pages 53–64, 2002.

[2] H. Briceno, P. Sander, L. McMillan, S. Gortler, andH. Hoppe. Geometry videos. In Proceedings of ACM Sympo-sium on Computer Animation 2003, July 2003.

[3] A. K. Katsaggelos, L. P. Kondi, F. W. Meier, J. Ostermann,and G. M. Schuster. Mpeg-4 and rate-distortion-based shape-coding techniques. Proceedings of the IEEE, 86(6):1126–1154, June 1998.

Figure 9: .File length for the different EZW encod-ed wavelet transform coefficients of the same syn-thetic depth image.

Table 1: Example bit rates for unconstrained freeviewpoint video with three reconstruction cameras.

IMAGERESOLUTION

DEPTH COLOR TOTAL BIT RATE

BITS / SURFEL BITS / SURFEL BITS / SECOND

107x80 0.166 0.7 350k160x120 0.5 1.5 560k

320x240 1.5 1.5 740k320x240 2.0 2.0 920k320x240

(uncompressed)8.0 8.0 10200k

EZW depth compression

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1

Co

mp

lete

file len

gth

[b

yte

s]

Haar Daub4 Daub6 Villa3 Villa4 Villa5 Villa6 Biorthognal 1-D

Page 9: In Copyright - Non-Commercial Use Permitted Rights / License: … · 2020. 3. 26. · Compression is achieved by exploiting coherence in the 3D point cloud data. In our compression

7 ETH CS Technical Report #430, December 10, 2003

[4] H. Lee, M. Desbrun, and P. Schroeder. Progressive encodingof complex isosurfaces. In Proceedings of SIGGRAPH 03,pages 471–475. ACM SIGGRAPH, July 2003.

[5] MPEG-3DAV. Description of exploration experiments in3DAV. ISO/IEC JTC1/SC29/WG11 N5700, July 2003.

[6] J. Ostermann, E. S. Jang, J.-S. Shin, and T. Chen. Coding ofarbitrarily shaped video objects in mpeg-4. In Proceedings ofthe International Conference on Image Processing, pages496–499, 1997.

[7] W. B. Pennebaker, J. L. Mitchell, G. G. Langdon, and R. Arps.An overview of the basic principles of the q-coder adaptivebinary arithmetic coder. IBM Journal of Research and Devel-opment, 32(6):717–726, 1988.

[8] J. Rissanen and G. G. Langdon. Arithmetic coding. IBM Jour-nal of Research and Development, 23(2):149–162, 1979.

[9] S. Rusinkiewicz and M. Levoy. QSplat: A multiresolutionpoint rendering system for large meshes. In SIGGRAPH2000 Conference Proceedings, ACM Siggraph Annual Con-ference Series, pages 343–352, 2000.

[10]S. Rusinkiewicz and M. Levoy. Streaming QSplat: A viewerfor networked visualization of large, dense models. In Pro-ceedings of the 2001 Symposium on Interactive 3D Graphics,pages 63–68. ACM, 2001.

[11]A. Said and W. A. Pearlman. A new fast and efficient imagecodec based on set partitioning in hierarchical trees. IEEETransactions on Circuits and Systems for Video Technology,6:243–250, June 1996.

[12]J. M. Shapiro. Embedded image coding using zerotrees ofwavelet coefficients. IEEE Transactions on Signal Process-ing, 41:3445–3462, December 1993.

[13]S. Vedula, S. Baker, and T. Kanade. Spatio-temporal viewinterpolation. In Proceedings of the 13th ACM EurographicsWorkshop on Rendering, June 2002.

[14]S. Wuermlin, E. Lamboray, and M. Gross. 3D video frag-ments: Dynamic point samples for real-time free-viewpointvideo. Computers & Graphics, Special Issue on Coding,Compression and Streaming Techniques for 3D and Multime-dia Data, 28(1), 2004

Figure 10: Average coding illustrated with a texture example: a) Key frame of 12kB. b) Masked key frame(PSNR=29.3dB). c) Key frame plus 100B of delta frame (PSNR=31.4dB). d) Key frame plus 300B of delta frame(PSNR=33.7dB). e) Key frame plus 4.3kB of delta frame (PSNR=42.8dB) f) Original frame.

Figure 11: Example images at various bit rates from an MPEG test sequence. The virtual viewpoint lies in betweenthree reconstruction cameras.

(a) (b) (c) (d) (e) (f)

350 kbps 560 kbps 740 kbps

920 kbps 10200 kbps (uncompressed)