ieee transactions on visualization and computer …research.navisto.ch/pdf/tvcg2007-ziegler.pdf ·...

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, JANUARY/FEBRUARY 2007 (TO APPEAR) 1

Low-Cost Telepresence forCollaborative Virtual EnvironmentsSeon-Min Rhee, Remo Ziegler, Member, IEEE, Jiyoung Park,

Martin Naef, Member, IEEE, Markus Gross, Senior Member, IEEE,and Myoung-Hee Kim, Member, IEEE

Abstract—We present a novel low-cost method forvisual communication and telepresence in a CAVETM-likeenvironment, relying on 2D stereo-based video avatars.The system combines a selection of proven efficient al-gorithms and approximations in a unique way, resultingin a convincing stereoscopic real-time representation of aremote user acquired in a spatially immersive display. Thesystem was designed to extend existing projection systemswith acquisition capabilities requiring minimal hardwaremodifications and cost.The system uses infrared-based image segmentation to

enable concurrent acquisition and projection in an immer-sive environment without a static background. The systemconsists of two color cameras and two additional b/wcameras used for segmentation in the near-IR spectrum.There is no need for special optics as the mask andcolor image are merged using image-warping based ona depth estimation. The resulting stereo image stream iscompressed, streamed across a network, and displayed asa frame-sequential stereo texture on a billboard in theremote virtual environment.

Index Terms—Remote Systems, Segmentation, Stereo,Virtual Reality, Teleconferencing.

I. INTRODUCTION

THE introduction of telepresence features intocollaborative applications promises improved

productivity as well as significantly higher useracceptance of long distance collaboration. In ad-dition to having a voice connection, rendering ahigh-quality image of the remote user in realtimeprovides a feeling of co-presence [?]. In particular,applications in immersive virtual environments are

S.-M. Rhee, J. Park and M.-H. Kim are with the Visual Computingand Virtual Reality Laboratory, Ewha Womans University, Seoul,Korea. Email: {blue, lemie}@ewhain.net, [email protected]. Ziegler and M. Gross are with the Computer Graph-

ics Laboratory, ETH, Zurich, Switzerland. Email: {zieglerr,grossm}@inf.ethz.chM. Naef is with the Digital Design Studio, The Glasgow School

of Art, UK. Email: [email protected]

Fig. 1. An example of a running collaboration with a snapshot ofboth sides.

prime candidates as the integration of the userrepresentation into the virtual shared world providesincreased social proximity, compared to a sepa-rate video. Practical implementations of telepres-ence systems in immersive environments, however,pose some severe difficulties, namely the lightingrequirements for video acquisition contradict withthe needs for high quality projection. Strong lightingand uniform backgrounds would ease acquisitionand segmentation of the user, however for projectionthe user has to stand in a dark room, surrounded byprojection surfaces, such as in a CAVETM[?].Numerous approaches, leading from complex and

expensive systems such as that in the blue-c [?], tosimpler systems using stereo cameras and IR-lightsin combination with beam-splitters in [?], have beentried to solve the problem. However, very high cost,static background, long processing times or difficultcalibration are limiting factors of previous systems.The work presented in this paper demonstrates

a way of extending existing spatially immersivedisplays with off-the-shelf components. The robust-ness of the segmentation and simplicity of theused image-warping approach lead to a real-timesystem with a high quality user representation whilekeeping the costs low.


A. ContributionsWe present a novel system for concurrent projec-

tion and image acquisition inside a spatially immer-sive display. It provides a low-cost solution based oncommodity hardware by combining segmentation inthe near-infrared spectrum, image-warping to matchthe IR and color information from different cameras,and depth extraction through stereo matching tocalculate disparities. The robustness of the segmen-tation and simplicity of the image-warping approachlead to a high quality user representation.The main contributions of this work can be sum-

marized as follows:• We introduce a practical implementation forimage segmentation in the near-infrared spec-trum, which works in a standard CAVETM-likeenvironment with active, concurrent projection.

• An image-warping algorithm and a stereo depthdetection method enable us to combine themask from the near-IR segmentation with acolor image, relaxing the strict alignment re-quirements from traditional approaches em-ploying beam splitters.

• Since no interpolation between different viewsis necessary using a stereo image stream, wemaintain a very high texture quality.

• The acquisition and transmission system isintegrated into an existing collaborative virtualreality toolkit.

The remainder of this paper is organized as fol-lows. In Section II we discuss different approachesin related work. Section III gives a system overviewof hardware and software components. We willdescribe some experiments including IR lamp po-sitioning and reflection property characterization togenerate good infrared reflective images for a robustuser silhouette in Section IV. Section V explainssilhouette fitting methods to generate stereo imagesfor a stereo-video avatar, followed by a descriptionof the integration of this avatar into the virtual worldin Section VI. Finally we present results includingperformance analysis in Section VII and concludein Section VIII.

II. RELATED WORK

In this section, we briefly review some of theexisting methods for acquisition and reconstructionof 3D shapes in general as well as acquisition of

different representations for supporting telepresencein Collaborative Virtual Environments (CVEs).

A. Fast Acquisition and ReconstructionThe main problem when acquiring and recon-

structing 3D objects for collaborative environmentsis to fulfill real-time constraints. In 1994 Laurentini[?] presented the ’visual hull’ which is createdfrom several silhouette images. Based on the ideaof visual hulls, Matusik et al. presented two veryfast methods to create an image-based visual hull(IBVH) taking advantage of epipolar geometry in[?], and a polyhedral visual hull constructing atriangular surface representation in [?]. In the past,infrared-based structured light [?] has been success-fully applied for reconstruction of 3D objects in ashort amount of time. Narayanan et al.[?] combinedthe depth map and intensity image of each cameraview at each time instant to form a Visible SurfaceModel. In 1998, Pollard and Hayes [?] renderedreal scenes from new viewpoints using depth maprepresentations by morphing live video streams.

B. Simultaneous Acquisition and DisplayingOne of the key issues of these methods is the

segmentation of foreground and background, whichhas been the topic of many papers to date. Telep-resence can be provided by a video avatar, whichis generated from live video streams from singleor multiple cameras. The TELEPORT system pre-sented in [?] uses 2D video avatars generated by adelta-keying technique. Raskar et al. [?] calculateddepth information using a structured light techniquein the Office of the Future. In the past few years,video avatar generation techniques based on multi-ple cameras (such as the National Tele-ImmersionInitiative [?]) have been developed. K. Tamagawaet al. [?], [?] developed a 2.5D video avatar con-sisting of a triangulated and textured depth image.Using a Triclops Color Stereo Vision system madeby Point Grey Research Inc., they apply a depthsegmentation algorithm and use blue screens toimprove the silhouettes by chroma-keying which isnot feasible in a running CVE system. Ogi et al. [?]take the basis of segmentation by stereo camerasand blue screening to a further step and generatevideo avatars as plane-, depth- and voxel-modelsaccording to their embedding needs. A very similarapproach to segmentation has been chosen by Suh et


al. [?], but the constraint of a constant backgroundcolor is very restrictive.Vi Rajan et al. [?] reconstruct a view dependent

textured head model using an additional trackerdevice with complementary background subtractionwhile approximating the body by a bounding box.In [?] researchers from INRIA are applying a re-altime reconstruction technique based on the visualhull algorithm developed at MIT assuming a staticbackground.Our work has been inspired by the blue-c system

[?]. This system exploited and extended the time-slicing concept of active stereo projection to controlthe lighting and shutdown the projection within ashort time interval to enable traditional segmenta-tion methods. In contrast to our approach, the blue-c acquires and transmits a dynamic 3D video object[?]. It requires significantly higher processing powerand custom built hardware to control the lightingand projection system.Thermal keying and IR-segmentation: The sys-

tems mentioned above all control the backgroundof the user to be acquired in the visible spectrum.A very different technique presented by Yasuda etal.[?] introduced a thermal vision camera in orderto realize a human region segmentation based onnatural heat emission of the body. However, thesystem fails to segment cold objects held by theuser. An IR segmentation approach using a beam-splitter was presented in the Lightstage system [?]and also more recently in [?]. Instead of choosingan IR-reflective material to bounce the light back,infrared translucent material is lit from behind in[?]. In a system presented in [?] a special retrore-flective background augmenting reflections of IR-light is chosen.Instead of requiring precise mechanical align-

ment needed when using beam-splitters, we alignthe IR and color images using an image warpingstep, which is less error-prone. It also works wellin a typical running CAVETMenvironment withoutspecially treated screen material. Furthermore, weavoid complicated 3D geometry while preserving asensation of depth by introducing a stereo billboard.

III. SYSTEM OVERVIEW

Stereoscopic displays play an important role invirtual reality by increasing the feeling of immer-sion. Instead of acquiring 3D geometry from the

Image Acquisition

Cam

era

C

alib

ratio

n

camera Info.

For Silhouette

For Texture

Virtual

WorldComposition Display

Network

right - eye image

Extract User Silhouette

Silhouette Fitting

Projective Texturing

left- eye image

Compression

Transmission

Re

cti

fic

ati

on

homography

Fig. 2. System Overview: the processing pipeline for the creationof stereo video avatars based on IR segmentation.

user, which is then rendered for both eyes separatelyonto the stereoscopic display, we directly captureimages from two texture cameras placed at a dis-tance corresponding roughly to binocular disparity.

In order to embed the captured collaborating partyin the virtual world a segmentation of the textureimages is necessary. We therefore use two camerapairs to generate a segmented, stereoscopic imagesequence. Each pair consists of a grayscale camerawith an attached IR bandpass filter and a colorcamera with an IR cutoff filter. We extract the usersilhouette mask using the grayscale camera in thenear infrared spectrum. (The silhouette mask will bereferred to as ’silhouette’ throughout the remainderof the paper.) The silhouette can then be used as amask on the texture image which is captured by thecolor camera, referred to as ’texture camera’.In the following Sections III-A and III-B we

describe the software and hardware components inmore detail.

A. Software ComponentsFig. 2 depicts the pipeline to generate the seg-

mented stereo image stream to support telepresence.The sequential nature of the algorithm is reflected


in the software components. Silhouettes of the usercan be found for each eye by comparing the currentIR image to a previously learned background image.Silhouettes are then fitted to the texture informationin order to mask out the user from the background.Finally, the texture image is warped by simulatinga backprojection onto the billboard.The resulting images are compressed and trans-

mitted to the collaborating site. There, the imagesare mapped onto a billboard that is placed into thevirtual scene. Each step of the pipeline is explainedin more detail in Section IV to Section VII.

B. Hardware ComponentsThe four sided CAVETM-like environment used

for our experiments consists of a front, left, right,and a floor part, all measuring 2.4m x 2.4m, andfour CRT projectors (BARCO-GRAPHICS 808s).Four cameras, four IR, and two ambient lights arearranged as depicted in Fig. 3.The four Dragonfly IEEE-1394 cameras from

Point Grey Research Inc. are placed on the top ofthe front screen. Of these, two are texture cam-eras that are mounted in the center and two aresilhouette cameras affixed on either side of thetexture cameras. All four cameras are attached toa single computer to simplify the synchronization.We avoid occlusion of the projection by mountingthe cameras on top of the front screen. Four 35WIR lamps with a light spectrum peaking at 730nmare carefully positioned to illuminate the user. Twoof them are mounted on top next to the cameras,

Fig. 3. Sketched and photographic view of the hardware setup,showing 4 IR lamps, 4 cameras and 2 ambient lights. (A)Scaled viewof one of the four used IR lights. (B)Scaled view of the specificarrangements of the components.

pointing roughly in the same direction as the cam-eras. Additional IR lights are placed in the left andright front corners on the floor. They point towardsthe middle of the front screen to diffuse the lightand illuminate the user indirectly from the bottom.The advantages of this setup are discussed in detailin Section IV-A.The projection environment has to be rather dark

since we use CRT projectors. We add two lights inthe visible spectrum to illuminate the user to yielda better texture quality. These lights must have afairly narrow light cone to avoid illuminating theprojection screen. The spotlights are mounted abovethe cameras on either side of the IR-lights.

IV. EXTRACT USER SILHOUETTE

To deal with the dynamic lighting environmentcaused by concurrent projection we use infraredreflective images as the input to a traditional back-ground subtraction algorithm. The near-infrared il-lumination in our environment is static since theprojectors only emit light in the visible spectrumand no other dynamic sources are present.The physical setup of our CAVETM-like environ-

ment does not allow uniform illumination of thebackground without illuminating the person as well.If we restrict the IR illumination to the back wallwe would only get half of the silhouette due to thecamera pointing down on the user (see Fig. 4). We

(a) (b)Fig. 4. Influence of IR back lighting vs. front lighting. (a) Simulatedmask and segmented image when using IR back lighting techniques.(b) Generated mask and segmented image when using IR frontlighting techniques.


therefore illuminate the person from the front anduse a background subtraction method optimized forinfrared light based on [?].Near-IR images only contain a single intensity

channel. Careful placement of the light sourcesis therefore critical in order to achieve a robustsegmentation, as compared to color image seg-mentation which typically relies on the hue andsaturation channels as well. Furthermore, choosingthe right materials for clothes and the backgroundis important. A short analysis of typical, relevantmaterials is included in the following sections. Sincethere is no interference between the IR light andthe texture cameras due to the integral IR cut-offfilters, we can do the analysis independently of theprojected content.

A. Infrared Lamp PositioningThe lights are positioned as close as possible

to the IR-cameras to minimize shadows and self-shadowing artifacts.In our setup we use one IR-light above each IR-

camera and one IR-light in each front corner ofthe CAVETM-like environment (Fig. 3). The twolights on the floor illuminate the legs indirectly bybouncing the light off the front wall (see Fig. 5a)while the two lights above are used to additionallyilluminate the top part of the body (see Fig. 5b).Fig. 5c shows the result when using all the lightssimultaneously.In order to get a good silhouette we have to

maximize the intensity difference ∆I between theforeground captured at runtime and the initial back-ground. In addition to careful positioning of the IRlights, we benefit from the intensity fall-off relativeto the distance squared. Fig. 6 illustrates the rapidintensity difference when increasing the distancebetween foreground and background d2.

(a) (b) (c)Fig. 5. Segmentation depending on different setups of the IR lights.(a) IR lights only on the floor. (b) IR lights only on top of the frontwall. (c) Combination of the two previous setups.

d1a

d2a

d1b

d2b0

1

2

30

1

2

3

0

1

2

3

4

d2 [m]

d1 [m]

∆I [lux]

(a) (b)Fig. 6. Intensity depending on distance from IR light sourceto object. (a) Illustration of possible distances. (b) Graph showingintensity difference depending on d1 and d2.

Assuming a Lambertian surface for the back-ground, surface orientation did not cause problemsin our experiments, and specular highlights of theforeground improve segmentation further. However,it can create problems with hair under certain cir-cumstances as discussed in Section IV-B.3.

B. Material, Skin and HairMost CAVETM-like environments are open to the

back. By hanging a curtain with low IR-reflection asa back wall, unwanted reflection caused by commonlab material can be removed.The foreground material mostly consists of cloth,

human skin, and hair. The cloth can of course bepicked for specific reflection properties, whereas theskin and hair have set constraints.1) Characterize Cloth: IR reflection is not re-

lated to visible colors, but to the combination ofpaint and fabrics (see Fig. 7a,b,c).Due to the difficulty of the IR reflectance char-

acterization, we built a simple setup to measure the

(a) (b) (c) (d)Fig. 7. IR reflection properties of different materials. (a) Part of asweater consisting of one single material, but colored with differentdyes. (b) Texture image of different color patches of the same material(c) and their IR reflection. (d) System used to scan material.


reflection properties of different materials (Fig. 7d).Using this simple scanning device, we can tellquickly if the cloth worn leads to good silhouettes.We capture an IR-image of the screen material andthe curtain in the background at a distance dc fromthe IR light and compare them to the intensity ofthe material captured at the same distance dc.The conservative way is to check if the difference

between the foreground intensity IF (0 ≤ IF ≤1) and the background intensity IB (0 ≤ IB ≤ 1)exceeds a threshold T . (1).

T < IF − IB (1)

More cloth can be accepted if we take the inten-sity fall-off into account. We can consider the worstcase scenario for the segmentation by calculating thedifference between the intensity of the foregroundat maximum distance from the light dfmax and theintensity of the background at minimum distancedbmin from the light, where dfmax < dbmin (see 2).

T <IF · d2

c

d2fmax

− IB · d2c

d2bmin

(2)

After scanning different cloth and various othermaterials, we found the majority to be reflectiveenough in order to be segmented correctly by ouralgorithm. Most people will therefore be able to useour system without changing clothes. We providea jacket for those rare cases where the user wearsnon-reflective cloth.2) Characterize Skin: If the IR reflection of skin

is measured at the same distance and illumination asthe background material, the segmentation does notperform well as can be seen in Fig. 8a. However,during collaboration the user has a certain distanceto the background material, which leads to a perfectsegmentation of skin. In Fig. 8b the hand is at 1 me-ter from the IR light source, while the backgroundis captured at 1.8 meters. The scaled difference ofthe skin and the background shows that skin canbe segmented reliably with a projection wall asbackdrop.3) Characterize Hair: Hair is the most difficult

part to be recognized as foreground as it does notreflect near-IR light well. The reflection propertyof different hair colors is very similar as shown inFig. 9, with a tendency of blonde reflecting slightlybetter than black hair. Only a slight enhancementcan be achieved by applying hair spray or gel

(a)

(b)hand background difference scaled

differenceFig. 8. Segmentation of skin. (a) Hand and Background are at thesame distance from the IR light source. (b) Hand is at 1 meter andBackground at 1.8 meter from the IR light source.

as it often enhances the specular component only,which is already strong. We therefore darken thebackground around the head region instead of tryingto brighten up the hair. Using a non IR reflectivecurtain as backdrop leads to good results for hairsegmentation even for black hair (Fig. 9).

Fig. 9. Different hair and skin color captured with a color and anIR camera.

V. STEREO IMAGE GENERATIONEach silhouette is fitted to the texture image

using estimated disparity before being applied asa mask leading to a segmented texture image. Aftermasking, we generate projective texturing images toreduce foreshortening.

A. Disparity EstimationWe define the centroid of a user in each silhouette

image as corresponding points. Knowing the twocentroids, we can estimate the depth of the pointin 3D space corresponding to the centroid by usingthe triangulation algorithm described in [?]. Afterrectification using [?], all epipolar lines becomeparallel to rows of the image (Fig. 10), which re-duces the searching space for corresponding pointsbetween the silhouette and the texture image to onedimension.By approximating the person by a plane and

assuming an upright position, the varying depthcan be approximated linearly. Consequently, we cancalculate disparity between silhouette and texture


(a) (b) (c) (d)Fig. 10. Epipolar lines for color and texture images beforerectification (a,b) and after rectification (c,d).

images with (3), where b is the length of the baselineof two cameras, f is the focal length and zh is theestimated depth of the user increasing from head tofeet.

dh =b · fzh

(3)

By calculating the disparity for each image pairindependently over time, we can observe jitteringartifacts due to differences in silhouette extractions,which lead to varying depth estimations. In orderto minimize jittering artifacts, but still keep a goodtexture fitting when the captured person is moving,we apply a Kalman filter to the estimated 3Dposition.

B. MaskingThe silhouette image is rectified by the homogra-

phy Hb, translated by T depending on the disparitydh and transformed by the inverse homographyH−1

c resulting in a mask for the texture image (seeFig. 11). Hc is the homography which would beneeded to rectify the texture image.For this to be a good approximation we make

the following assumptions: both cameras lie closeto each other; the user is approximated by a plane;and we do not have significant distortion differencesbetween the cameras. Applying the disparity cor-rection to the silhouette leaves the implicit captureddepth cues of the texture cameras unchanged.

C. Projective TexturingDue to the positioning of the capturing cameras

above the front wall, we get images at an obliqueangle, causing foreshortening. Repositioning of thecameras at eye-level is impossible due to occlusionsof the screen. Applying projective texturing ontoa stereo video billboard allows rendering of thecaptured user along a vertical parallax and thereforefrom eye-level while minimizing foreshortening ef-fects (see Fig. 12b).

(a) silhouette mask (b) rectified silhouette mask

(c) translated rectified silhouette mask(d) silhouette fitting

(e) masked texture

Ptr(x, y, 1)

= T * Pr (x, y, 1)

Pr(x, y, 1)

= Hb * Po(x, y, 1)

Pt (x, y, 1)

= Hc-1 * Ptr (x, y, 1)

Pt (x, y, 1)

= Hc-1 * T * Hb * P0 (x, y, 1)

Fig. 11. Silhouette fitting process : Po(x,y,1) is a pixel position inthe silhouette mask, Pc(x,y,1) is a pixel position in texture image, Hc,Hb is a homography for the rectification of the texture and silhouetteimage respectively.

(a) (b)Fig. 12. Projective Texturing. (a) before (b) after

Simplifying the geometry of the user by a planeyields to some distortion during reprojection. How-ever, it is restricted to areas with high variance ofthe distance of the true 3D geometry of the user tothe plane. We name this distance zP (see Fig. 13a).The resulting distortion yP is given as

yP =yC · zP

zC

, (4)

with zC being the distance in z-direction from thecamera center to a point Pi on the user and yC beingthe distance between the capturing camera CCap andthe rendering camera CRen. The introduced error islimited to the vertical parallax and has no influenceon the horizontal parallax and also no influence onthe disparity values as shown in the Section V-D.Therefore, projective texturing is a valid solution


for minimization of foreshortening.

D. Disparity AnalysisOmitting a detailed 3D geometry and approx-

imating the user by a plane requires a thoroughanalysis of the final disparity visible to the user inthe CAVETM-like setup. Since there are only twocameras and no detailed geometry reconstruction,the representation can only be rendered correctlyfrom a vertical parallax including the position ofthe capturing camera CCap. In order to assure thequality of the disparity through the whole pipelinefor this viewing range, steps potentially havingan effect on it will be analyzed in more detail.Masking of the textures, projective texturing andfinally rendering the stereo billboard in the virtualscene are the three critical steps.Since masking the textures using the silhouette

images is not modifying the disparities but onlysetting parts of the texture image to transparent, thecorrectness of the disparities is trivial.Furthermore, the influence of the projective tex-

turing step on the disparity is analyzed. Points of theuser lying on the same plane as the billboard (e.g.P1

in Fig. 13a,b) are reprojected to the original placein 3D space. This will lead to the correct disparityindependent of camera distance to the billboard. Forthe points lying distance zP away from the billboard(e.g.P2 in Fig. 13a,b), the reprojection of the leftand right eye images lead to two distinct points.The intersection of the rays from the camera centersof CLeft and CRight to the reprojected points liesat a distance zP from the billboard as long as thecameras lie z = zP + zC from the billboard. Notethat the height of the position of the camera does nothave an influence on zP but only on the distortionyP .We assume the same baseline b as well as the

same focal length f for the cameras for the render-ing CRen as for the cameras in the capturing CCap

setup. Furthermore, the distance of correspondingpoints on the billboard xP (see Fig. 13c,d)is inde-pendent of the movement of the local user duringrendering, leading to

xP

zP

=b

zC

andxP

z′P=

b

z′C(5)

Therefore, the perceived depth of every point ofthe remote user zP is changing proportionally to zwhere z = zP + zC .

Consequently, the remote user appears correct ifthe distance zCap from the remote user to the cap-turing cameras and the distance zRen from the localuser to the billboard are the same. We performedan informal system evaluation with 16 people ofwhom 4 are experts with stereoscopic displays and12 are non experts. The collaborating person wascaptured at a distance of 1.4m. The users were put ina CAVETMsystem with one mono billboard and onestereo billboard representing the same collaborator.They were asked to choose the best range to seethe correct 3D appearance, the distance which wasacceptable for a 3D representation, and the distancefrom where they could not distinguish between themono billboard and the stereo billboard. Results areshown in Table I. Some of the users mentioned thatthe perceived depth does not continuously increasewhen standing further and further away, but ratherleads to the same depth perception as for the monobillboard. This is similar to viewing a real 3Dobject from far away, where parallax informationbecomes a weaker depth cue than self-shadowing orocclusion. If a user stands closer than 90cm to thefront wall not only will the depth cue given throughdisparity almost get lost, but she will also be badlycaptured by the cameras. Therefore this positionwould make little sense during a collaboration.This informal evaluation confirms the introduc-

tion of an additional depth cue by using the stereo

CRen

CCap

P1

P2

zCzP

yP

yC

CLeft

CRigthP2

P1

(a) (b)zCzP

P3

CLeft

CRigth

bxP

CLeft

P3

CRigth

z‘C

b

z‘P

xP

(c) (d)Fig. 13. Different drafts of the setup configuration are depicted (a)from the side and (b) from the top considering reprojection onto thegreen billboard. In (c) and (d) a rendering setup for two differentviewing positions are shown.


billboard. It would require detailed user studiesin order to elaborate which aspects of the stereobillboard are most advantageous for co-presence.This will be subject for future work. [b]

TABLE IINFORMAL SYSTEM EVALUATION: 3D APPEARANCE [CM]

Best Acceptable Norange distance difference

Experts 132-147 92-225 79Non 141-167 110-205 102Experts

VI. INTEGRATION

The generated stereo images of the video avatarare transmitted to the collaborative party to beintegrated into the virtual world. In order to sendthe images in realtime, we apply a JPEG compres-sion and transfer them as a packet with additionalheader information using the TCP/IP protocol. Theresulting stereo image is rendered onto a billboardto represent the user inside the virtual world. Thebasic features required for rendering a video streamare provided by the blue-c API, namely its 2Dvideo service [?]. The original 2D video service wasenhanced with stereo capabilities to render differentimages for both eyes. If the user is moving relativeto the billboard, the billboard is rotated in order tobe perpendicular to the line between the camera andthe billboard center.Blending the user seamlessly into the virtual

environment requires an alpha channel to stencil outthe background. In order to avoid the overhead ofsending an additional image mask for transparency,we define a color range appearing transparent.As opposed to defining a single matte color, thisapproach permits the encoding of different alphavalues. It has proven to be robust enough againstJPEG compression artifacts for our purposes.

VII. RESULTS & APPLICATIONS

In this section, we will firstly show a performanceanalysis for the generation of video avatars and theirtransmission over the network, secondly discussthe texture quality, and thirdly present results andpossible applications.

A. PerformanceFor the performance analysis we consider two

parts. One is the processing time of the capturedimages and the other is the transmission of those im-ages over the network. For the performance analysisof the rendering part we refer to [?]. By the term’frame’ we are in fact referring to two images, oneper eye.Included in the processing time of the captured

images are all the steps from the acquisition to theprojective texture generation. See Tab. II. Parallelprocessing or a multi computer setup have not yetbeen exploited, even though the system would lenditself naturally to such a performance optimization.

TABLE IIPROCESSING TIME [S] PER STEREO IMAGE FOR DIFFERENT

RESOLUTIONS.

Resolution 640x480 320x240Acquisition 0.015 0.014

Background subtraction 0.059 0.013Masking 0.047 0.012

Projective texture generation 0.047 0.011JPEG compression 0.055 0.001

Sum 0.223 0.051

Processed images have to be transmitted overa network to the other collaborating party, wherebandwidth limitations can cause a bottleneck. Dur-ing our experiments, we sent video avatars gener-ated at ETH Zurich in Switzerland over an inter-continental link to a CAVETM-like system at EwhaWomans University in Korea. We measured a totalavailable network bandwidth of up to 161.25KB/secleading to a transmission time of 0.094s for a640x480 frame. Considering the processing time is2.4 times longer, the bottleneck does not lie in thetransmission of the data.Doing the processing and the transmission of the

images in parallel lets us update video avatars in thevirtual world in realtime (19fps for 320x240). Whenincreasing the resolution to 640x480 the frameratedrops to 4.5fps.

B. Texture qualityThe visual quality of a fixed viewpoint stereo

video billboard depends on the texture quality andthe correct disparity treatment due to the lack ofa precise geometry. Two steps in the pipeline can


cause a decrease of texture quality, namely pro-jective texturing and JPEG compression. Projectivetexturing can introduce distortion depending on thevariance in distance from the true 3D geometry ofthe user to the billboard as described in Section V-C.This problem limits the possibility of using our rep-resentation for collaboration, where exact pointinggestures are required. However, the high frequencydetails of the texture are mostly preserved sincethere is no blending between different viewpoints,therefore a high-quality texture can be generated. Inthe future, further improvements could be achievedby substituting the JPEG compression by MPEG4.

C. Visual Results and ApplicationOur acquisition of a stereo video avatar is very

fast since no stereo-algorithm, visual hull or anyother geometrical reconstruction is necessary. De-spite our simple representation the ability to per-ceive collaborators in 3D under the constraints men-tioned in Section V-D is possible due to stereo videoimages on the one hand, and due to high texturequality maintaining additional visual depth cues(e.g. self-shadowing) on the other. Additionally, thecaptured texture quality facilitates non-verbal com-munication cues. The system can be integrated inan up to six sided CAVETM-like environment with-out encountering problems of realtime segmenta-tion. Therefore the representation can be seamlesslyblended into a virtual collaboration environment.In Fig. 14 and Fig. 15 we integrated our sys-

tem in a Chess application from [?] for test pur-poses and experienced realtime collaboration overan intercontinental link. In the future, applicationslike tele-conferencing, virtual architecture guidance,collaborative planning and even collaborative gamescould be envisaged. This light-weight and cheapextension to virtual environments, delivering hightexture quality at fast frame rates, is ideal for manycollaborative environments even beyond CAVETM-like systems.

VIII. CONCLUSION AND FUTURE WORK

We created a simple and cheap (US$6000) setupbased on infrared and texture cameras, which doesnot rely on calibration-critical optical componentssuch as semi-transparent mirrors thanks to a texturewarping algorithm that matches the images fromdifferent cameras.

Our approach works very well for face to facecollaboration settings and also more generally forcollaborations allowing a fixed viewing range ofthe collaborating party, since a per pixel disparityappearing correct to the user (see Section V-D)is equivalent to per pixel depth information. Assoon as the viewpoint of the user changes alonga horizontal parallax, our system lacks geometry oradditional cameras to which the viewpoint could beswitched. However, we achieved real time collabo-ration with high texture quality and accurate depthalong the vertical parallax through the capturedposition, and limited depth accuracy when movingtowards or from the object. Furthermore, our systemperforms very well even if the user is surroundedby a running CAVETMenvironment. These criteriaappeared to be more important for achieving anincreased feeling of co-presence in a face to facecollaboration than an accurate geometrical repre-sentation of the user, which is cumbersome andvery expensive to achieve in a running immersiveenvironment [?]. In the future detailed user studieswill be necessary in order to elaborate which aspectsof our representation have most effect on the feelingof co-presence.Currently our hardware installation is limited to

a single stereo acquisition pair and therefore to asingle viewing direction. Adding additional camerasaround the user would allow switching betweenstreams for additional viewing directions. The IR-based segmentation mask would also enable us toperform a full 3D object reconstruction, similar toMatusik et al.’s polygonal visual hull or the blue-c3D video fragment system. Future work will also in-clude optimization of the IR lighting to increase therobustness of the segmentation. In particular, the IRdiffusion properties of projection screens could befurther exploited to get a uniform dark background,which would make a characterization of hair andcloth redundant. Our work could be extended tomultiple users per collaboration environment likeshown in Fig. 15e and Fig. 15f, which wouldfurther enhance group collaborations. In order toprovide a complete collaboration environment wewill integrate audio capabilities in future versions.

ACKNOWLEDGMENT

The authors would like to thank Tim Weyrich,Miguel A. Otaduy, Daniel Cotting and Filip Sadlo


for constructive discussions influencing our workand Hyeryung Kwak, Juhaye Kim and Ji-Hyun Suhfor generating final images.This work was supported in part by the Korean

Ministry of Information and Communication (MIC)under the Information Technology Research Center(ITRC) Program and in part by the Korea Scienceand Engineering Foundation (KOSEF) under theInternational Research Internship Program.

REFERENCES

[1] J. Allard, E. Boyer, J-S. Franco, and G. Raffin C. Menier.Marker-less real time 3d modeling for virtual reality. InImmersive Projection Technology, 2004.

[2] Lance Chong, Hank Kaczmarski, and Camille Goudeseune.Video stereoscopic avatars for the caveTMvirtual environment.In Proc. 2004 IEEE International Symposium on Mixed andAugmented Reality. IEEE Computer Society Press, 2004.

[3] C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti. Surround-screen projection-based virtual reality: the design and im-plementation of the cave. In Proc. Twentith Annual Conf.Computer Graphics and Interactive Techniques, pages 135–142.ACM Press, 1993.

[4] J. W. Davis and A.F. Bobick. Sideshow: A silhouette-basedinteractive dual-screen environment. Technical Report 457, MITMedia Lab, 1998.

[5] P. Debevec, A. Wenger, C. Tchou, A. Gardner, J. Waese, andT. Hawkins. A lighting reproduction approach to live-actioncompositing. ACM Trans. Graphics, 21(3):547–556, 2002.

[6] A. Fusiello, E. Trucco, and A. Verri. A compact algorithm forrectification of stereo pairs. Machin Vision and Applications,12(1):16–22, 2000.

[7] S. J. Gibbs, C. Arapis, and C. J. Breitender. Teleport -towardsimmersive copresence. Multimedia Systems, 7(3):214–221,1999.

[8] M. Gross, S. Wuermlin, M. Naef, E. Lamboray, C. Spagno,A. Kunz, E. K-Meier, T. Svoboda, L. Gool, S. Lang,K. Strehlke, A. V. Moere, and O. Staadt. blue-c: a spatiallyimmersive display and 3d video portal for telepresence. ACMTrans. Graphics, 22(3):819–827, 2003.

[9] A. Laurentini. The visual hull concept for silhouette-basedimage understanding. IEEE Trans. on Pattern Analysis andMachine Intelligence, 16(2):150–162, 1994.

[10] S-Y. Lee, I-J. Kim, S. C. Ahn, H. Ko, M-T. Lim, and H-GKim. Real time 3d avatar for interactive mixed reality. InProc. ACM SIGGRAPH Int’l Conference on Virtual RealityContinuum and its Applications in Industry(VRCAI’04), pages75–80. ACM Press, 2004.

[11] W. Matusik. Image-based visual hulls. Master’s thesis, Massa-chusetts Institute of Technology, 2001.

[12] W. Matusik, C. Buehler, and L. McMillan. Polyhedral visualhulls for real-time rendering. In Proc. of Twelfth EurographicsWorkshop on Rendering, pages 115–125. Springer, 2001.

[13] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMil-lan. Image-based visual hulls. In Proc. SIGGRAPH 2000, pages369–374. ACM Press, 2000.

[14] M. Naef, O. Staadt, and M. Gross. blue-c api: A multimedia and3d video enhanced toolkit for collaborative vr and telepresence.In Proc. ACM SIGGRAPH Int’l Conference on Virtual RealityContinuum and its Applications in Industry(VRCAI’04), pages11–18. ACM Press, 2004.

[15] P. J. Narayana, P. Rander, and T. Kanade. Constructing virtualworlds using dense stereo. In Proc. Int’l Conference onComputer Vision ICCV 98, pages 3–10. IEEE ComputerSocietyPress, 1998.

[16] T. Ogi, T. Yamada, M. Hirose, M. Fujita, and K. Kuzuu. Highpresence remote presentation in the shared immersive virtualworld. In Proc. IEEE Virtual Reality 2003 Conference (VR’03).IEEE Computer Society Press, 2003.

[17] T. Ogi, T. Yamada, Y. Kurita, Y. Hattori, and M. Hirose. Usageof video avatar technology for immersive communication. InProc. First Int’l Workshop on Language Understanding andAgents for Real World Interaction 2003, 2003.

[18] T. Ogi, T. Yamada, K. Tamagawa, M. Kano, and M. Hirose.Immersive telecommunication using stereo video avatar. InProc. Virtual Reality 2001 Conference (VR’01). IEEE ComputerSociety Press, 2001.

[19] S. Pollard and S. Hayes. View synthesis by edge transfer withapplication to the generation of immersive video objects. InProc. of the ACM Symposium on Virtual Reality Software andTechnology, pages 91–98. ACM Press, 1998.

[20] R. Raskar, G. Welch, M. Cutts, A. Lake, L. Stesin, andH. Fuchs. The office of the future: A unified approach toimage-based modeling and spatially immersive displays. InProc. SIGGRAPH98, pages 179–188. Addison Wesley, 1998.

[21] A. Sadagic, H. Towles, J. Lanier, H. Fuchs, A. Van Dam,K. Daniilidis, J. Mulligan, L. Holden, and B. Zeleznik. Nationaltele-immersion initiative: Towards compelling tele-immersivecollaborative environments. In Proc. Medicine meets VirtualReality 2001 conference, 2001.

[22] M. Slater, A. Sadagic, M. Usoh, and R. Schroeder. Small groupbehaviour in a virtual and real environment: A comparativestudy. Presence: Teleoperators and Virtual Environments,9(1):37–51, 2000.

[23] Y. Suh, D. Hong, and W. Woo. 2.5d video avatar augmentationfor vr photo. In Proc. ICAT 2002, 2002.

[24] S.Wuermlin, E. Lamboray, O. Staadt, and M. Gross. 3d videorecorder: A system for recording and playing free-viewpointvideo. Computer Graphics Forum, 22(2):181–193, 2003.

[25] G. Taubin. Camera model and triangulation. Note for EE-148:3D Photography, CalTech, 2001.

[26] S. Subramanian V. Rajan, D. Keenan, A. Johnson, D. Sandin,and T. DeFanti. A realistic video avatar system for networkedvirtual environments. In Proc. Seventh annual immersiveprojection technology symposium(IPT2002). ACM Press, 2002.

[27] K. Yasuda, T. Naemura, and H. Harashima. Thermo-key:Human region segmentation from video. IEEE ComputerGraphics and Application, 24(1):26–30, 2004.


Seon-Min Rhee received the BS and MSdegrees in computer science and engineeringfrom Ewha Womans University at Seoul, Ko-rea in 1999 and 2001, respectively. She iscurrently a PhD student in the Departmentof Computer Science and Engineering, EwhaWomans University. Her research interests are3D video and human computer interaction.

Remo Ziegler received a MSc. in computerscience in 2000 from the Swiss Federal Insti-tute of Technology ETHZ in Zurich, Switzer-land, and worked during the four follow-ing years as a Research Assistant at Mit-subishi Electric Research Laboratories MERLin Boston, USA and NICTA a national re-search Laboratory in Sydney, Australia. He

recently started his PhD back at the Computer Graphics Laboratoryat ETHZ. His main interests lie in 3D scanning, reconstruction andprojective technology. He is Member of IEEE and ACM.

Jiyoung Park received the BS and MS degreesin computer science and engineering fromEwha Womans University, Seoul, Korea in2002 and 2004, respectively. She is currentlya PhD student in the Department of ComputerScience and Engineering, Ewha Womans Uni-versity. Her research interests include com-puter graphics, virtual reality, medical image

processing.

Martin Naef holds a MSc. in computer sci-ence from the Swiss Federal Institute of Tech-nology, Zurich (ETHZ), and has received hisPh.D. from the Computer Graphics Laboratoryat ETHZ for his work on the blue-c Appli-cation Programming Interface. He is currentlyworking at the Digital Design Studio of theGlasgow School of Art in Scotland. He is

Member of IEEE and ACM.

Markus Gross is a professor of computerscience and director of the Computer GraphicsLaboratory of the Swiss Federal Institute ofTechnology (ETH) in Zurich. He received aMaster of Science in electrical and computerengineering and a PhD in computer graphicsand image analysis, both from the Universityof Saarbrucken, Germany. From 1990 to 1994,

Dr. Gross worked for the Computer Graphics Center in Darmstadt,where he established and directed the Visual Computing Group.His research interests include point-based graphics, physics-basedmodelling, multiresolution analysis, and virtual reality. He has beenwidely publishing and lecturing on computer graphics and scien-tific visualization, and he authored the book “Visual Computing”,Springer, 1994. Dr. Gross has taught courses at major graphicsconferences including ACM SIGGRAPH, IEEE Visualization, andEurographics. He is the associate editor of the IEEE ComputerGraphics and Applications and has served as a member of inter-national program committees of many graphics conferences. Dr.Gross has been a papers co-chair of the IEEE Visualization ’99, theEurographics 2000, and the IEEE Visualization 2002 conferences. Hechaired the papers committee of ACM SIGGRAPH 2005.

Myoung-Hee Kim received the BA degreefrom Ewha Womans University, Seoul, Koreain 1974, the MS degree from Seoul NationalUniversity in 1979 and the PhD degree fromGottingen University, Germany in 1986. Sheis currently a professor of computer scienceand engineering and director of the VisualComputing and Virtual Reality Laboratory and

director of the Center for Computer Graphics and Virtual Reality atEwha Womans University. She has served as a conference cochair ofPacific Graphics (2004) and Israel-Korea Binational Conference onGeometrical Modeling and Computer Graphics. She is the Presidentof Korea Computer Graphics Society (KCGS) and the Vice Presidentof Korea Society for Simulation (KSS). Her research interests includecomputer graphics, virtual reality, computer simulation and medicalimage visual computing.


(a) (b)

(c) (d)Fig. 14. For reasons of quality verification all the snapshots have been taken in the mono mode of our environment. (a,b) High qualitystereo video avatar composition in a Chess application. (c) Realtime mirroring of user in running CAVETM-like system. (d) Collaborationexample of two users over an intercontinental link.

(a) (b) (c)

(d) (e) (f)Fig. 15. A sequence of snapshots from our running collaboration system showing visions of possible future work. (a-c) Direct collaborationbetween two parties. (d) User mirrored into virtual world captured in stereo mode. (e,f) Future collaboration environment, where more thanone person per location can join a meeting in a virtual world.

ieee transactions on visualization and computer …research.navisto.ch/pdf/tvcg2007-ziegler.pdf ·...

Documents