3-d image processing in the future of immersive media - researchgate

17
PAPER IDENTIFICATION NUMBER #11 1 AbstractThis survey paper discusses the 3D image processing challenges posed by present and future immersive telecommunications, especially immersive video conferencing and television. We introduce the concepts of presence, immersion and co-presence, and discuss their relation with virtual collaborative environments in the context of communications. Several examples are used to illustrate the current state of the art. We highlight the crucial need of real-time, highly realistic video with adaptive viewpoint for future, immersive communications, and identify calibration, multiple view analysis, tracking, and view synthesis as the fundamental image processing modules addressing such need. For each topic, we sketch the basic problem and representative solutions from the image processing literature. Index Terms— immersive communications, videoconferencing, 3DTV, 3D image processing, computer vision. I. INTRODUCTION HIS survey paper discusses the future of immersive telecommunications, especially video conferencing and television, and the 3D image processing techniques needed to support such systems. The discussion is organized in two parts. Part 1 introduces immersive telecommunication systems through the concepts of presence, immersion and co-presence, and their relation with virtual collaborative environments and shared environments within communication scenarios. We focus on two major applications, immersive video conferencing and immersive television, which have risen as challenging research areas in the past few years. In both, immersiveness relies mostly on visual experience within a mixed reality scenario, in which Manuscript received January 13, 2003. The work has been partially supported by EU Framework-V grant VIRTUE (IST-1999-10044). F. Isgrò is with the Dipartimento di Informatica e Scienze dell’Informazione, Università di Genova, via Dodecaneso 35, 16146 Genova, Italy (phone: +39 010 3536609; fax: +39 010 3536699; e-mail: [email protected]). E. Trucco is with the School of Engineering and Physical Sciences, Electrical, Electronic and Computer Engineering, Heriot-Watt University, EH14 4AS Edinburgh, Scotland (phone: +44 131 4513437; fax: +44 131 4514155; e-mail: [email protected]). P. Kauff is with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Einsteinufer 37, D-10587 Berlin, Germany (phone: +49 30 31002-615; fax: ++49 30 3927200; e-mail: [email protected]) O. Schreer is with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Einsteinufer 37, D-10587 Berlin, Germany (phone +49 30 31002-620; fax: ++49 30 3927200; e-mail: [email protected]). participants interact with each other in a half real, half virtual environment. As the virtual imagery is created electronically from real video material, it is necessary to ask which computer vision and image processing techniques will play a major role in supporting the immersive conferencing and television systems of the future. Part 2 attempts an answer focused on the two key applications identified in Part 1. The key modules required must, crucially, support the dynamic rendering of 3D objects correctly and consistently; 3D image processing is therefore at the heart of immersive communications. Our discussion, limited for reasons of space, identifies calibration, multiple view analysis, tracking, and view synthesis as the fundamental image processing modules that immersive systems must incorporate to achieve immersiveness within mixed-reality scenarios. For each of these topics, we sketch the basic problem and some known, representative solutions from the image processing literature. Further modules, obviously necessary (e.g., figure-background segmentation) but less foundational or characteristic for our applications, are not dealt with here. II. APPLICATIONS, SCENARIOS AND CHALLENGES A. From Presence to Immersive Telepresence The idea of immersive media is grounded in two basic concepts, presence and immersion. The structure of presence has been studied for a long time in the interdisciplinary field of human factors research. Although several aspects are still unclear, it is commonly agreed that the basic meaning of "presence" can be stated as "being virtually there" [1][2]. The different approaches found in the literature can be divided roughly into two main categories, social and physical presence (Fig.1). 3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA Francesco Isgrò, Emanuele Trucco, Peter Kauff and Oliver Schreer T

Upload: others

Post on 04-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

1

Abstract— This survey paper discusses the 3D image

processing challenges posed by present and future immersive telecommunications, especially immersive video conferencing and television. We introduce the concepts of presence, immersion and co-presence, and discuss their relation with virtual collaborative environments in the context of communications. Several examples are used to illustrate the current state of the art. We highlight the crucial need of real-time, highly realistic video with adaptive viewpoint for future, immersive communications, and identify calibration, multiple view analysis, tracking, and view synthesis as the fundamental image processing modules addressing such need. For each topic, we sketch the basic problem and representative solutions from the image processing literature.

Index Terms— immersive communications, videoconferencing, 3DTV, 3D image processing, computer vision.

I. INTRODUCTION HIS survey paper discusses the future of immersive telecommunications, especially video conferencing and

television, and the 3D image processing techniques needed to support such systems.

The discussion is organized in two parts. Part 1 introduces immersive telecommunication systems through the concepts of presence, immersion and co-presence, and their relation with virtual collaborative environments and shared environments within communication scenarios. We focus on two major applications, immersive video conferencing and immersive television, which have risen as challenging research areas in the past few years. In both, immersiveness relies mostly on visual experience within a mixed reality scenario, in which

Manuscript received January 13, 2003. The work has been partially

supported by EU Framework-V grant VIRTUE (IST-1999-10044). F. Isgrò is with the Dipartimento di Informatica e Scienze

dell’Informazione, Università di Genova, via Dodecaneso 35, 16146 Genova, Italy (phone: +39 010 3536609; fax: +39 010 3536699; e-mail: [email protected]).

E. Trucco is with the School of Engineering and Physical Sciences, Electrical, Electronic and Computer Engineering, Heriot-Watt University, EH14 4AS Edinburgh, Scotland (phone: +44 131 4513437; fax: +44 131 4514155; e-mail: [email protected]).

P. Kauff is with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Einsteinufer 37, D-10587 Berlin, Germany (phone: +49 30 31002-615; fax: ++49 30 3927200; e-mail: [email protected])

O. Schreer is with the Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Einsteinufer 37, D-10587 Berlin, Germany (phone +49 30 31002-620; fax: ++49 30 3927200; e-mail: [email protected]).

participants interact with each other in a half real, half virtual environment. As the virtual imagery is created electronically from real video material, it is necessary to ask which computer vision and image processing techniques will play a major role in supporting the immersive conferencing and television systems of the future.

Part 2 attempts an answer focused on the two key applications identified in Part 1. The key modules required must, crucially, support the dynamic rendering of 3D objects correctly and consistently; 3D image processing is therefore at the heart of immersive communications. Our discussion, limited for reasons of space, identifies calibration, multiple view analysis, tracking, and view synthesis as the fundamental image processing modules that immersive systems must incorporate to achieve immersiveness within mixed-reality scenarios. For each of these topics, we sketch the basic problem and some known, representative solutions from the image processing literature. Further modules, obviously necessary (e.g., figure-background segmentation) but less foundational or characteristic for our applications, are not dealt with here.

II. APPLICATIONS, SCENARIOS AND CHALLENGES

A. From Presence to Immersive Telepresence The idea of immersive media is grounded in two basic

concepts, presence and immersion. The structure of presence has been studied for a long time in

the interdisciplinary field of human factors research. Although several aspects are still unclear, it is commonly agreed that the basic meaning of "presence" can be stated as "being virtually there" [1][2]. The different approaches found in the literature can be divided roughly into two main categories, social and physical presence (Fig.1).

3-D IMAGE PROCESSING IN THE FUTURE OF IMMERSIVE MEDIA

Francesco Isgrò, Emanuele Trucco, Peter Kauff and Oliver Schreer

T

Page 2: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

2

Virtual Reality

(VR)

C inem a

Flight S imulator

TV Radio

Phone

E-M ail

OnlineChat

Docum entSharing

Shared V irtual

Environments(SVE)

VideoPhone

VideoConference

Physical Presence

Social Presence

Co- Presence

Fig. 1. Classification of presence (adapted from [1], [2]). Social presence refers simply to the feeling of being

together with other parties engaged in a communication activity. It does not necessarily aim at a good reproduction of spatial proximity or at high levels of realism. A visualization of the communication situation is not required and sometimes even undesirable. In fact, Internet chats and e-mail, phone calls or conventional paper letters may give us a strong impression of social presence. In contrast, physical presence concerns the sensation of being physically co-located in a mediated space with communication partners; radio, TV and cinema are classical examples. Note that most of these are entertainment services that, although sometimes enjoyed in groups, do not necessarily improve the social aspects of communication.

At the intersection of these two categories, we can identify the quite new area of co-presence systems (see Fig.1). Video conferencing and shared virtual environments are good examples of this intermediate category: by providing a sense of togetherness and an impression of co-location in a shared working space, such systems support social and physical presence simultaneously.

In contrast to the general interpretation of presence formulated from the human factors standpoint, the concept of immersion fits into the technical domain and is clearly linked to the category of physical presence. Immersion concerns concrete technical solutions specifically improving the sense of physical presence in a given application scenario. It is interesting to notice that the roots of immersive systems are not found in telecommunications, but in stand-alone applications like cinema, theme park entertainment, flight simulators and other virtual training systems; a popular example is the transition from conventional cinemas to IMAX theatres. The introduction of immersion in telecommunications and broadcast is a new and exacting challenge for image processing, computer graphics and video coding. Recent advances in computer hardware, networks, and 3D video processing technologies are beginning to supply adequate support for algorithms to meet this challenge. A new type of capability, immersive telepresence, is therefore emerging through real systems. The user of an immersive telepresence system feels part of a virtual or mixed-reality scene within an

interactive communication situation, e.g., video conferencing. This feeling is mainly determined by visual cues like high-resolution images, realistic rendering of 3D objects, low latency and delay, motion parallax, seamless combination of artificial and live video contents, but also by acoustic cues like correct and realistic 3D sound.

We concentrate here on two major applications of immersive telepresence. The first is immersive 3D videoconference, which allows geographically distributed users to hold a videoconference with a strong sense of physical and social presence. Participants are led to believe to be co-present at a round-table discussion thanks to the realistic, real-time reproduction of real-life communication cues like gestures, gaze directions, body language and eye contact. The second application is immersive television, which can be regarded as a next-generation broadcast technology. The ultimate challenge for immersive broadcast systems is the natural reproduction and rendering of large-scale, real-world 3D scenes and their interactive presentation on suitable displays.

Before we discuss next-generation systems in these two applications (sections C and D), we sketch the main concepts behind collaborative systems and shared environments in general.

B. Collaborative Systems and Shared Environments in Immersive Communications Collaborative systems allow geographically distributed

users to work jointly at the same task. A simple example is document sharing, for which NetMeeting is often used in collaborative teamwork applications. A more advanced approach is collaborative virtual environments (CVE) or shared virtual environments (SVE). These are typical co-presence systems, but not necessarily immersive. The main reason is that they are usually PC-based applications with small displays, which much reduces the scope for real-life, realistic interaction.

Fig. 2. Representation of an office scene for CVE applications (from

German KICK project).

Page 3: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

3

Example applications are given by the European IST project TOWER (Theatre of Work) and the German project KICK (Fig.2). They both provide awareness of collaborative activities among team members and their shared working context through symbolic presentation in a virtual 3D environment [3][4]. The aim is to enhance distributed collaborative work with group awareness and spontaneous communication capabilities very similar to face-to-face working conditions.

The step towards immersive telecollaboration (ITC) occurred with the development of CAVEs and workbenches (Fig.3) [5]. These systems achieve presence by allowing the user to interact naturally with a virtual environment, using head tracking or haptic devices to align the virtual and real worlds. CAVEs and workbenches were originally stand-alone systems, mainly developed for presentation purposes, but the increasing number of CAVE sites, combined with increasingly accessible broadband backbone networks, propitiated the introduction of telecollaboration.

Fig. 3. CAVE, University College London. A particularly important field of CVE or SVE in the context

of immersive video conferencing is shared virtual table environments (SVTE). The basic idea is to place 3D visual representations of participants, usually graphical avatars, at predefined positions around a virtual table. An example is the European ACTS project COVEN [6] that demonstrated the benefits of the SVTE concept by a networked, VR business game application in 1997 (Fig.4) [7].

In more complex systems, the motion and 3D shape of the participants are captured at each terminal by a multiple camera set-up like the ones shown in Fig.5 (VIRTUE system, see Section II.C).

Fig. 4. ACTS project Coven. The 3D arrangement of participants around the shared table

is ideally isotropic (all participants appear in the same size and are equally spaced around the table), as symmetry suggests equal social importance. Hence, in a three-party conference the participants would form an equilateral triangle, in four-party one a square, and so on.

Virtualscene

Fig. 5. Multi-view capture for the VIRTUE three-party conference system. Following this composition rule and given the number of

participants, the same, appropriate SVTE can be displayed at each terminal of a conference. Individual views of the virtual conference scene can then be rendered at each terminal using a virtual camera (Fig.6) which follows the instantaneous position of the participant's eyes, continuously estimated by a head tracker.

Virtualscene

Virtual camera

Head Tracking

ControlMotionparallax

Fig. 6. Rendering of virtual 3D conference.

Page 4: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

4

Assuming that the real and virtual worlds are correctly calibrated and aligned, the conference scene can be displayed to each participant from the correct viewpoint, even if the participant moves his or her head continuously. The geometric alignment of virtual and real worlds crucially supports the visual presence cues of a SVTE application, e.g., gaze awareness, gesture reproduction and eye contact. In addition, the support of head motion parallax allows participants to change their viewpoint purposively, e.g., to watch the scene from a different perspective or to look behind objects.

As already mentioned, most approaches in this area have been limited to strictly graphical environments and avatars for visualizing remote users. Several such systems have been proposed during the last decade; recently, researchers have begun to integrate video streaming into the virtual scene, for instance incorporating video presentations on virtual screens in the scene, or integrating seamlessly 2D video images or even 3D video avatars into CVEs to increase realism [8][9]. Some of these approaches were driven by the advent of the MPEG-4 multimedia standard and its powerful coding and composition tools. An example is Virtual Meeting Point (VMP), developed by the Fraunhofer Institute for Telecommunications/Heinrich-Hertz-Institut in collaboration with Deutsche Telekom [10]. VMP is a low-bit-rate, MPEG-4-based software application, in which the image of each participant becomes a video object to be pasted in a virtual scene, displayed on each participant’s screen. Further examples are the virtual conference application of the European ACTS project MoMuSys or the IST project SonG [11][12].

C. Immersive Videoconferencing Effective videoconferencing is an important facility for

business with geographically distributed operations, and high-speed, computer-based videoconferencing is a potential killer application, gathering research efforts from major market players like VTEL, PictureTel, SONY, Teleportec, VCON and others. Corporate reasons for using video conferencing systems include business globalization or decentralization, increased competition, pressure for higher reactivity and shorter decision-making, increase in number of partners, and reduced time and travel costs. According to a proprietary Wainhouse research from March 2001, voice conferencing and email are still preferred, in general, to videoconferencing. Key barriers seem to be high unit prices, the limited (perceived) business needs, the cost of ownership, and concerns about integration, lack of training, and user friendliness. One reason is that these systems still offer little support for natural, human-centered communication. Most of them are window-based, multi-party applications where single-user images are presented in separate PC windows, or displayed in full-body size on large video walls, often in combination with other window tools. One example used frequently in the field of desktop applications is NetMeeting (Fig.7). Notice that gestures, expressions and appearance are reproduced literally, i.e., camera images are displayed unprocessed.

Fig. 7. Screen shot of a NetMeeting session. Further examples can be found in the framework of the US

Internet2 consortium, for instance the Access Grid (AG) (Fig. 8), the Virtual Rooms Video Conferencing Service (VRVS), the Virtual Auditorium of Stanford University (Fig.9) or the Global Conference System [13][14][15][16].

Fig. 8. Access Grid.

Fig. 9. Design of a virtual auditorium, University of Stanford. Based on high-speed backbone networks, these systems

offer high-quality audio and video equipment with presence capabilities for different applications like teleteaching, teleconferencing and telecollaboration. Nevertheless realism is lacking as eye contact and realistic viewing conditions are not

Page 5: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

5

supported. Other systems are dedicated to two-site conferences and use a point-to-point connection between two user groups. The local group sits at one end of a long conference table placed against a large screen; the table is continued virtually in the screen, and the remote group appears at the other end of the table (in the screen). All participants get the impression of sitting at the same table. Furthermore, as the members of each group sit close together and the virtual viewing distance between the two groups is quite large, eye contact, gaze direction and body language can be reproduced at least approximately. However, as mentioned above, such videoconference table systems are usually restricted to two-site, point-to-point scenarios. Two commercial state-of-the-art examples are reported in [17] and [18] (Fig.10 and 11).

Fig. 10. Plasma-Lift videoconference table, D+S Sound Lab.

Fig. 11. Examples of the Teleportec system. The main restriction in all these systems is the use of

conventional, unprocessed 2D video, often coded by MPEG-2 or H.263. This makes it impossible to meet a basic requirement of immersive, human-centered communication in videoconferencing, that every participant gets his or her own view of the conference scene. This feature requires a virtual camera tracking the individual viewer's viewpoint and an adaptation of the viewpoint from which the incoming video images are displayed. Given this, the mission of immersive 3D video conferencing can be seen as combining the SVTE concept with adequately processed video streams, consequently taking advantage of both the high grade of realism provided by real video and the versatile functionalities of SVTE systems. The main objective is to offer rich communication modalities, as similar as possible to those used in face-to-face meetings (e.g., gestures, gaze awareness, realistic images, correct sound direction). This would

overcome the limitations of conventional video-conferencing and VR-based CVE approaches, in which face-only images shown in separate windows, unrealistic avatars, and missing eye contact impoverish communication.

The most promising video-based SVTE approach is probably the tele-cubicles [19][20] developed within the US National Tele-Immersion Initiative (NTII)[21]. Here, remote participants appear on separate stereo displays arranged in an SVTE-like spatial set-up. A common feature is the symmetric arrangement of participants around the shared table, with each participant appearing in his own screen (Fig. 12). Note that symmetry guarantees consistent eye contact, gaze awareness and gesture reproduction: everybody in the conference perceives consistently who is talking to whom or who is pointing at what (i.e., everybody perceives the same spatial arrangement) and in the correct perspective (i.e., the view is consistent with each individual viewpoint). For example, if the person at the terminal in Fig.12 talks to the one on the left while making a gesture towards the one on the right, the latter can easily recognize that the two others are talking about him. Viewing stereo images with shutter glasses supports the 3D impression of the represented scene and the remote participants.

Fig. 12. Set-Up of the tele-cubicle approach of UNC, Chapel Hill and

Univ. of Pennsylvania. The tele-cubicle concept holds undeniable merit, but it still

carries disadvantages and unsolved problems. First of all, the specifically arranged displays appear as 'windows' in the offices of the various participants, resulting in a restricted mediation of social and physical presence. Furthermore, the tele-cubicle concept is well suited to a fixed number of participants (e.g., three in the set-up of Fig.12) and limited to single-user terminals only, but does not scale well: any addition of further terminals requires a physical re-arrangement of displays and cameras, simply to adjust the geometry of the SVTE set-up to the new situation. Finally, it is difficult to reconcile the tele-cubicle concept with the philosophy of shared virtual working spaces.

Although the NTII has already demonstrated an integration of telecollaboration tools into their experimental tele-cubicle

Page 6: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

6

set-up, the possibility of joint interactions is limited to two participants only, and shared workspaces with more than two partners are hard to achieve because of the physical separation of tele-cubicles windows.

To overcome these shortcomings, a new SVTE concept has been proposed by the IST project VIRTUE (Virtual Team User Environment) [22][23]. It offers all the benefits of tele-cubicles, but extends them by integrating SVTEs with shared virtual working spaces. The main idea is a twofold combination of the SVTE and mixed-reality metaphors. Firstly, a seamless transition between the real table in front of the display and the virtual conference table in the screen gives the user the impression of being part of a single, extended perceptual and working space. Secondly, the remote participants are rendered seamlessly and in the correct perspective into the virtual conference scene using real video with adapted viewpoint. Fig.13 shows the VIRTUE set-up, which has been demonstrated in full for the first time at the Immersive Communication and Broadcast Systems (ICOB) workshop in Berlin in January 2003.

Fig. 13. The VIRTUE demonstrator.

D. Immersive Broadcast Systems Being present at a live event is undeniably the most exciting

way to experience any kind of entertainment. The mission of immersive broadcast services is to bring this experience to users unable to participate in person. A first technical approach was realized by Kanade at Superbowl 2001 within the EyeVision project [24]. The objective was to design a new broadcast medium, combining the realism of video or cinema with natural interactions with scene contents, as in VR applications, and to provide immersive home entertainment systems bridging the gap between the different levels of participation and intensity granted by live events and state-of-the art consumer electronics (Fig.14). For this purpose, immersive broadcast services must incorporate three different features: panoramic large-screen video viewing, stereo viewing and head motion parallax viewing.

ImmersiveTV

CD

TV

Interactive TV

VoD

Rock concertFootball

Theatre

Cinema

Intensity of Experience

Costs

Fig. 14. Objective of Immersive TV. Panoramic viewing is well-known from large-screen

projection techniques in cinema or IMAX theatres. Electronic panorama projection is mainly attractive for digital cinema, but also for other immersive visualization techniques like video walls, “office of the future”, ambient rooms, CAVE and workbenches. Such large-screen projections require an extremely high definition, say, at least 4000x3000 pixels. Often, the horizontal resolution required is even higher. In contrast, the best digital cinema projectors, which are available at the market, are limited to QXGA resolution (2048x1536 pixels) and can be very expensive. Due to these drawbacks, several researchers have proposed to mosaic multiple projections into one large panoramic image. One example is the CineBox approach of the German Joint project D_CINEMA. As shown in Fig.15, it is a modular approach using one CineBox as basic unit. Each CineBox provides an MPEG-2 HD-decoder with extended functionality.

Fig. 15. Multiple video projection with six cascaded CineBoxes. It offers electronic blending functions to control a seamless

transition from one image to another in overlap areas. In addition, MPEG decoding of various CineBoxes can be synchronized. Hence, cascading CineBoxes for multiple projections is very flexible, and up to six HD-images can be mosaiced into a panoramic view. Further examples and details on multiple projection techniques can be found in [25][26].

Page 7: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

7

Lodge [27][28] has been the first to propose the use of such techniques for broadcast.

The resulting Immersive TV concept envisages to capture, encode and broadcast wide-angle, high-resolution views of live events combined with multi-channel audio, and to display these events through individual, immersive, head-mounted viewing systems in combination with a head-tracker and other multi-sensory devices like motion seats (Fig.16).

High resolution view

Viewer

Fig. 16. Immersive TV viewed by head mounted systems. The most significant feature, however, is that Immersive TV

targets a one-way distribution service. This means that, unlike usual VR applications, the same signal can be sold to any number of viewers without the need for the broadcaster to handle costly interactive networks and servers. A possible implementation, similar to the one shown in Fig.15 for digital cinema, is outlined in Fig.17. A large panoramic view is transmitted in the form of multiple HD MPEG-2 streams, synchronized at the receiver and stitched seamlessly into one image.

H2U-BOX High to Ultra Definition Merger Box

3 HD MPEG-2 Recorder Decoder(Hyperbox)

UHDTV Merger Unit

HD Projectors HMD

DVBTransmission

Fig. 17. Technical implementation of an Immersive TV system. A special merger unit is used for this purpose, and allows

the user to look around the scene while wearing a head-mounted display. Optionally, the stitched HD frames can also be watched jointly by a small group of viewers as wide-screen panorama projection, for which purpose a given number of HD projectors can be plugged into a single merger unit. An

extension of this system towards head motion parallax viewing, called Interactive Virtual Viewpoint Video (IVVV), is reported in [29][30]. Several panoramic viewing systems have been proposed for Internet applications; [31] is a representative example.

Another important cue for Immersive TV is stereo viewing. The exploitation of stereo vision for broadcast services has long been a focus in 3D television (3D-TV). However, most approaches to 3D-TV were restricted to the transmission of two video streams, one for each eye. In contrast, the European IST project ATTEST has recently proposed a new concept for 3D-TV [32][33]. It is based on a flexible, modular and open architecture that provides important system features, such as backwards compatibility to today’s 2D digital TV, scalability in terms of receiver complexity and adaptability to a wide range of different 2D and 3D displays. The data representation and coding syntax of the ATTEST system make use of the layered structure shown in Fig.18. This structure basically consists of one base layer and at least one additional enhancement layer. To achieve backwards compatibility to today’s conventional 2D digital TV, the base layer is encoded by using state-of-the-art MPEG-2 and DVB standards. The enhancement layer delivers the additional information to the 3D-TV receiver.

Single User

MultipleUser

Head Tracking

2D

2D

3D

3D

3D Warp

Layered Coding Syntax

AdvancedLayer

Decoder

Base Layer

Advanced Layer

DVBMPEG-2 Decoder

Fig. 18. Immersive TV viewed by head mounted systems. The minimum information transmitted here is an associated

depth map providing one depth value for each pixel of the base layer to be able to reconstruct a stereo view from the baseline MPEG-2 stream. Note that the layered structure in Fig.3 is extendable in this sense. For critical video content (e.g., large-scale scenes with a high amount of occlusions) one can add further layers, for example segmentation masks and maps with occluded texture. Hence, the ATTEST concept can be seen as an interesting introduction scenario for immersive TV [34]. It

Page 8: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

8

strictly follows an evolutionary approach, being backwards compatible to existing services on one hand and open for future extensions on other hand. In addition, it allows the usage of head tracker to support head motion parallax viewing for both 2D and 3D displays. This is an important feature for immersive broadcast services, because it gives the user the opportunity to interact intuitively with the scene content: for instance, the head position can be changed purposively to watch the same TV scene from different viewpoints. Thus, ATTEST is a first step towards visionary scenarios like free-viewpoint TV, Virtualized Reality, EyeVision or Ray-Space TV, where the user can walk with a virtual camera through moving video scenes [35][36][37][38].

III. 3D VIDEO PROCESSING FOR IMMERSIVE COMMUNICATIONS How will 3-D computer vision and video processing play a

role in the scenarios described above? Which techniques are most likely to be needed, developed, and integrated in the immersive communication systems of the future? In Fig.19, we attempt a schematic representation of our answer, focusing especially on the systems described in sections II.C and II.D. From an input set of video sequences the system generates in general a new video sequence, e.g., seen from a different viewpoint. Some calibration is generally necessary; this is either pre-computed by an off-line procedure or obtained on-line from a set of features tracked and matched across the sequences. Once the system is calibrated, a description of the scene must be extracted from the video data in order to create the output sequence. To this end we look for a spatial relationship between synchronized frames in the various sequences, via the image-matching module. Once the structure of the scene has been determined the output sequences can be rendered according to the particular application needs.

In the following we identified some foundational techniques for the applicative scenarios described in this paper. For each of them we identify basic definitions and problems, provide a quick, structured tour of image processing solutions, and try to point out what solutions are feasible for immersive systems.

. .. .

. . .

. . .

CALIBRATIONCALIBRATION SCENE, ILLUMINATION,AND VIRTUAL CAMERAMODELS

SCENE, ILLUMINATION,AND VIRTUAL CAMERAMODELS

TRACKINGTRACKING

STEREO ANDMULTIPLE VIEW

MATCHING

STEREO ANDMULTIPLE VIEW

MATCHING

SCENEREPRESENTATIONSCENEREPRESENTATION RENDERINGRENDERING

INPUTVIDEO

SEQUENCE

OUTPUTVIDEO

SEQUENCE

Fig.19: Schematic representation of a computer vision system for video immersive applications

A. Camera calibration Camera calibration [39][40] is the process of estimating the

parameters involved in the projection equations. Such parameters encode the geometry of the projection taking place in each camera and the positions of the cameras in space, and are necessary information if we want to extract the Euclidean geometry of the scene from 2D images. This is frequently useful or even necessary for immersive systems, for instance to guarantee geometric consistency across different terminals, or to select the correct size of synthetic objects to be integrated in a virtual scene. Calibration algorithms abound in literature, but, given the practical importance of the topic, it seems appropriate to include a concise discussion on recent developments. We identify two main classes of camera calibration techniques: Euclidean calibration and self-calibration.

1) Euclidean calibration Euclidean calibration methods estimate the values of the

projection parameters with no ambiguity apart from their intrinsic accuracy limits. These methods are based on the observation of a calibration object, the 3-D geometry of which is known to a high accuracy. Examples of this approach are given in [39][41][42][43][44]. These methods may look tricky, as they require particular set-ups (e.g., special calibration objects are necessary, sometimes to be placed in particular positions). This can be a drawback, especially for large-market applications, where users cannot be expected, in general, to go through complicated set-up procedures. Therefore easier and more flexible algorithms are ideally required, for which little or no set-up is necessary. Some steps in this direction have been made recently: in [45] the calibration is performed from at least two arbitrary views of a planar calibration pattern which can be constructed easily printing dots on a sheet of paper. More recently the same author presented an algorithm for calibration from one-dimensional objects [46].

Most immersive communications systems include several cameras (for instance, VIRTUE uses 4 cameras), which must be calibrated within the same world reference frame. This can be achieved simply with ad hoc calibration patterns (a multifaceted one, for instance) visible from all cameras, but more specialized algorithms exist, especially for the case of 2-camera stereo [44][47]. For systems with more than two cameras, the reader is referred to [48][49][50].

2) Self calibration Where several views of a scene are available (e.g., multiple-

camera systems, moving cameras), a full Euclidean calibration may be difficult to achieve, or even unnecessary. In these cases it is still possible to achieve a weak calibration, which still gives the geometric relation among the different cameras, but in a projective space. In practical terms, this involves computing an algebraic structure encoding the constraints imposed by the multi-camera geometry. These structures encode important relations among images (e.g., useful constraints for correspondence search) and are computed directly from image correspondences. From these structures it

Page 9: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

9

is possible to recover (up to a projective transformation) the camera parameters.

The algebraic structures mentioned above are: the fundamental matrix [51] for a two-camera stereo system, the trifocal tensor [52] for a three-camera system, and the quadrifocal tensor for a four-camera system. The list stops here, as it has been proved that no such structures combining more than four images exist [53].

Recent research on calibration has focused on this problem, called self-calibration: estimating the camera parameters from weakly calibrated systems and with little or no a priori Euclidean information about the 3-D world. The main assumption made by this class of algorithms is the rigidity of the scene [54][55][56][57]. A critical review of self-calibration techniques can be found in [58].

The highest potential of self calibration for immersive communications is probably in the entertainment industry, e.g., mixing movies from different sources, or creating augmented reality film effects with no set rigs or additional structures (see for instance the BouJou system by 2D3 [59]). For applications requiring a strong sense of physical presence, such as immersive teleconferencing, it is usually acceptable to fully calibrate the system off-line.

B. Multiview correspondence Multiview correspondence, or multiple view matching, is

the fundamental problem of determining which parts of two or more images (views) are projections of the same scene element. The output is a disparity map for each pair of cameras, giving the relative displacement, or disparity, of corresponding image elements (see Fig.20, Fig.21 and Fig.22). Disparity maps allow us to estimate the 3-D structure of the scene and the geometry of the cameras in space.

a b

Fig.20: Camera arrangement for VIRTUE setup (left-hand stereo pair)

Fig.21: Resulting original stereo pair acquired from cameras a and b.

Fig.22: Visualization of the associated disparity maps from camera a to b (left) and from b to a (right). Notice inversion of grey levels encoding disparity)

Passive stereo [40] remains one of the fundamental

technologies for estimating 3-D geometry. It is desirable in many applications because it requires no modifications to the scene, and because dense information (that is, at each image pixel) can nowadays be achieved at video rate on standard processors for medium-resolution images (e.g., CIF, CCIR) [60][61][62]. For instance systems in the late 90s already reported a frame rate of 22Hz for images of size 320x240 on a Pentium III at 500 MHz [63].

The availability of real-time disparity maps also enables segmentation by depth, which can be useful for layered scene representation [34][64][65][66]. Large-baseline stereo, generating significantly different images, can be of paramount importance for some SVTE applications, as it is not always possible to position cameras close enough to achieve small baselines, or because doing so would imply using too many cameras given speed or bandwidth constraints. The VIRTUE system [22] is an example: four cameras can only be positioned around a large plasma screen, and using more than four cameras would increase delay and latency beyond acceptable levels for usability (but see recent systems using high numbers of cameras [37][67][68]).

There are two broad classes of correspondence algorithms, seeking to achieve, respectively, a sparse set of corresponding points (yielding a sparse disparity map) or a dense set (yielding a dense disparity map).

1) Sparse disparities and rectification Determining a sparse set of correspondences among the

images is a key problem for multiview analysis. It is usually performed as the first step in order to calibrate (fully or weakly) the system, when nothing about the geometry of the imaging system is known yet, and no geometric constraint can be used in order to help the search.

Page 10: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

10

We can classify the algorithms presented in literature so far in two categories, feature matching and template matching. Algorithms in the first category select feature points independently in the two images, then match them using tree searching, relaxation, maximal clique detection or string matching [69][70][71][72]. A different algorithm is given in [73], which presents an interesting and easy to implement algebraic approach based on point positions and correlation measures. Algorithms in the second category select templates in one image (usually patches with some texture information), and then look for corresponding points in the other image using a similarity measure [39][74][75]. The algorithms in this class tend to be slower than the ones in the first class as the search is less constrained, but it is possible to speed up the search for some particular cases [76].

The search for matches between two images is simplified and sped up if the two images are warped in such a way that corresponding points lie on the same scanline in both images. This process is called rectification [40]. The rectified images can often be regarded as acquired by cameras rotated with respect to the original ones, or images of these cameras projected onto the same plane. Most of the stereo algorithms in the literature assume rectified images.

2) Dense disparities Dense stereo matching is a well-studied topic in image

analysis [39][40]. An excellent review including suggestions for comparative evaluation1 is given in [78]; we refer the reader to this paper for an exhaustive list of algorithms. Here we give a general discussion of dense matching, and focus on large-baseline matching given its importance for advanced visual communication systems.

The output of a dense matching algorithm is a disparity map. As already mentioned (Section III.A.1, self calibration) the matching image points must satisfy geometric constraints imposed by the algebraic structures such as the fundamental matrix for two views, plus other constraints (physical and photometric). These include order (if two points in two images match, then matches of nearby points should maintain the same order); smoothness (the disparities should change smoothly around each pixel); uniqueness (each pixel cannot match more than one pixel in any of the other images).

Points are usually matched using correlation-like correspondence methods [78]: given a window in a frame, standard methods in this class explore all possible candidate windows within a given search region in the next frame, and pick the one optimizing an image similarity (or dissimilarity) metric. Typical metrics include SSD (sum of squared differences), SAD (sum of absolute differences), or correlation. Typically the windows are centered around the pixel we are computing the disparity. This choice can give poor performance in some cases (e.g., around edges). Results can be improved adopting multiple windows matching, where

1 A methodology for evaluating the performance of a stereo matching

algorithm for the particular task of immersive media has been recently proposed in [77].

different windows centered in different pixels are used [79][80], at the cost of a higher computational time.

Computation of disparity maps can be expensive, but some tricks can be used to speed up the computation of the similarity measure by using box filtering techniques [78], and partial distances [81]. Good guidelines for an efficient implementation stereo matching algorithms on state of the art hardware is given in [82] and [83].

3) Large-baseline matching This is the difficult problem of determining

correspondences between significantly different images, typically because the cameras' relative displacement or rotation is large. This problem is very important for our scenario, see for instance the VIRTUE terminal (Fig.13). As a consequence of the significant difference between the images, direct correlation-based matching fails at many more locations than in small-baseline stereo. From an algorithmic point of view, the images of a large-baseline stereo pair lead to significant disparities, and may present considerable amounts of relative distortions and occlusions.

Large camera translations and rotations induce large disparities in pixels, thus forcing search algorithms to cover large areas and increasing the computational effort. Large displacements between cameras may introduce also geometric and photometric distortions, which complicate image matching. As to occlusions, the farther away the viewpoints, the more likely occluded areas (i.e., visible to one camera but not to the other). The problem of occlusions can be partially solved, at the cost of extra computation, in multi-camera systems as long as every scene point is imaged by at least two cameras [84][85][86]. However, in practice, increasing the number of cameras may increase the risk of unacceptably high delay and latency.

Solutions to the problem of large-baseline matching include intrinsic curves, coarse-to-fine approaches, maximal regions [88] and other invariant regions [88], [89]. Intrinsic curves [90] are an image representation that transforms the stereo matching problem into a nearest-neighbor problem in a different space. The interest of intrinsic curves here is that they are ideally invariant to disparity, so that they support matching, theoretically, irrespective of disparity values. In coarse-to-fine approaches [91][92][93] matching is performed at increasing image resolutions. The advantages are that exhaustive search is performed only on the coarsest-resolution image, where the computational effort is minimal, and only localized search takes place on high-resolution images. Approaches based on invariant features rely on properties that remain unchanged under (potentially strong) geometric and photometric changes between images. Very good results have been reported, but computational costs are usually high and direct application to real-time, telepresence systems unfeasible. Indeed all the techniques above are still too time-consuming if the target is a full-resolution disparity map for full-size video at frame rate. [60] and [94] are two approaches addressing this point within immersive communications, by exploiting the

Page 11: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

11

redundancy of information in video sequences. The former [60] unifies the advantages of block-recursive disparity estimation and a pixel-recursive optical flow estimation in one common scheme, leading to a fast matching algorithm. The latter [94] uses motion detection to reduce the quantity of pixels at which disparity is computed.

C. View synthesis View synthesis addresses the problem of generating

convincing virtual images of a 3-D scene from real images acquired from different viewpoints, without reconstructing explicit 3-D models of the scene. In other words, the target is to generate the image that would be acquired by a specified virtual camera from an arbitrary point of view, directly from video or images. The range of applications for such techniques is very wide, including collaborative environments, computer games, virtual and augmented reality, and it is of paramount importance for immersive communication systems (see Fig. 23 for some examples).

The generation of virtual images has been for years the territory of computer graphics. The classic approach generates synthetic images using a 3-D geometric model of the scene, a photometric model of the illumination, and a model of the camera projection. A CAD model is typically adopted for the scene, and it must be created manually or obtained from 3-D sensors (time of flight, triangulation, passive stereo). Colors are obtained by mapping texture onto the model; the texture can again be artificial or obtained from real images.

Fig.23: Frames from a synthetic fly-over around a moving speaker, generated from the synchronized stereo sequence from which the pair in Fig.20 was extracted.

In recent years a new, alternative trend for the generation of synthetic views has emerged, which is based on real images only. This approach, called Image Based Rendering [95][96] (henceforth IBR), aims to generate photorealistic synthetic images and videos solely from multiple, real images, which capture all the necessary information about a scene under real illumination conditions. There is no need for complex, 3-D geometric models or physics-based radiometric simulations (e.g., light sources, surface reflectance) to achieve realism, as the realism is in the images themselves [97]. Some modeling is however still necessary to guarantee the consistency of the synthetic images.

It is impossible to generate views of a generic scene without any 3-D information. Most view synthesis algorithms do not compute 3-D structure explicitly, but need dense disparity maps between the input images, information intimately related with the 3-D structure of the scene. As a consequence, the quality of the synthetic views depends crucially on the quality of the disparity maps. Where the accuracy of disparities degrades (typically and especially in occluded areas) artifacts are introduced in the novel view.

A class of techniques, that, following [97], we call here CAD-like modeling, represent a compromise between the classic computer graphics approach (using full geometric and radiometric models) and IBR methods (using only images). A notable example of CAD-like modeling is the system developed at Carnegie Mellon University by Takeo Kanade’s group [67]. These methods concatenate computer vision and computer graphics modules: they first obtain a 3-D reconstruction of the scene from images, and then use the recovered model to render novel views. The advantage is that it is possible to use existing, specialized rendering hardware, as IBR methods are not yet well supported by hardware and software libraries [98] at the same level of polygon-based computer graphics techniques. However, real-time implementations of IBR techniques do exist [99][100], and hardware supporting image warping is appearing [101].

A classical approach to rendering novel views is image interpolation, introduced in [102], that has been popularized by QuickTimeVR products [103]. They can only produce images that are intermediate views between two original images (i.e., the virtual camera lies on the baseline between the two real cameras). This approach was adopted in the PANORAMA system [104]. Various researchers have adopted image interpolation for creating novel views; Seitz and Dyer [105] showed that straightforward image interpolation generates views that are not physically valid, i.e., they cannot be produced by any real camera, and derived a criterion for creating correct synthetic views.

Another way for generating synthetic views is to construct a lookup table approximating the plenoptic function [106] (the plenoptic function is a representation of the flow of light at every 3D position and for every 2D viewing direction) from a series of sample images taken from different viewpoints. An image view from an arbitrary viewpoint is then synthesized by

Page 12: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

12

interpolating the lookup table. Representative approaches using this technique are [107][108] where, under the assumption of free space, a 4D plenoptic function has been adopted. The results produced are very realistic, but the method needs a very large number of densely sampled images (in the order of 100), therefore it is not easily suitable for on-line applications as the ones considered in this paper. It is worth to point out that the method does not need to compute explicitly dense disparities between the images, as the effort is transferred in the image sampling procedure exploiting the fact that the plenoptic function used is 4D.

Pollard et al. [109] use three sample images (and cameras) and edge transfer to synthesize any view within the triangle defined by the three optical centers; only edge matching is necessary, instead of dense pixel correspondence.

A wider range of views can be created using point transfer based on the principle of the fundamental matrix for re-projection [110], called epipolar transfer. The drawback is a non-natural way of specifying the virtual viewpoint if no calibration is available (the method suggested is to choose the position of four control points in the target image), and the existence of degenerate configurations: the method fails to reproject points lying on the trifocal plane (i.e., the plane containing the three optical centers), and any points at all when the three optical centers are collinear.

More generic and easier to implement is the method presented in [111], which exploits the algebraic relation existing within triplets of images formalized by the trifocal tensor. The real advantage of this method is that the method fails only in re-projecting points on the baseline between the two real cameras. The paper also suggests a simple way to specify the novel viewpoint that can be used for epipolar transfer as well.

To generate a synthetic view, in general, correspondence information between the original images is necessary, and it is not surprising that the quality of the synthetic images strongly depends on the quality of the disparity maps. Moreover the presence of occluded areas (i.e., 3D part of the scene visible only in one image, for which no disparity values are available) can create disturbing artifacts.

It is worth to mention at this point that IBR techniques are not conceptually different from 3D reconstruction plus reprojection [97]: they are indeed a shortcut, as disparity maps supply, in principle, the same information as 3D structure (considering only scene points for which disparity can be computed).

Points re-projection alone, although quite effective, is not enough to cover all the situations occurring in immersive communications, as images are rendered under the same lighting conditions of the original scene. Recent work has sought to incorporate also illumination changes in the IBR process. The limit of standard IBR algorithms is that they assume a static scene with fixed illumination; when the viewpoint is moved, the illumination may move rigidly with it. However in several applications, e.g., navigation of

environments or augmented reality, illumination variation should be considered. If a linear reflectance model (e.g., Lambertian) applies, and light sources are assumed at infinity, then every light direction can be synthesized by a linear combination of three static images under different light directions [112]. More recently it has been shown [113] that an image of a Lambertian surface obtained with an arbitrary distant light source, can be approximated by the linear combination of a basis of nine images of the same scene under different light conditions.

Unfortunately the assumptions are not realistic in most cases, so that different techniques are needed. The techniques reported can be divided in two categories, estimation of the Bidirectional Reflection Distribution Function (BRDF) and photometric IBR.

Methods in the first class generally require knowledge of the 3-D structure of the scene and information about lighting conditions. They recover the complete BRDF from several static images acquired under different lighting conditions [114][115], or even from a single image [116]. Typically these methods are very slow and may require up to several hours of computation. Photometric IBR, instead, does not recover any reflectance model, but uses a set of basis images under different lighting conditions as a model of the BRDF. Images under synthetic lights are obtained as interpolation of the basis images [117][118][119]. However these methods are still unwieldy for real-time applications, as computing reflectance models can be too computationally expensive and difficult with dynamic scenes.

All the methods mentioned so far apply to rigid scenes and fail if this condition is not satisfied. This case has been addressed by recent work modeling non-rigid surfaces as stochastically rigid [120].

D. Tracking Video tracking is the problem of following moving targets

through an image sequence. We mention it last as it is a ubiquitous module, and its importance for the scenarios introduced in the first part of this paper cannot be underestimated. For instance, videoconferencing systems incorporating 3-D effects need head tracking to estimate the 3-D head position of the viewer in order to generate images of the participants. In VIRTUE [22], head position estimates are achieved by passive video tracking, but a variety of head tracking technologies exist, either passive [121][122] or active [123]. Here, passive tracking means that the target (in this case the head) is tracked using optical sensors placed in scene and following a set of landmarks placed on the moving user (that can be natural landmarks as nose, eyes, etc.), whereas active tracking means that the optical sensor are on the moving user, and the landmarks are fixed targets in the scene.

In augmented and mixed reality applications, inserting CAD elements in real video consistently with the current image is a key requirement. "Consistently" means that the size, position and shading of synthetic objects must match those of the surrounding image or real objects. This requires tracking the

Page 13: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

13

motion of real objects and the egomotion of the camera (or equivalent information), usually achieved by tracking subsets of image points. In terms of performance characteristics, the following is sought of a video tracker: robustness to clutter (the tracker should not be distracted by image elements resembling the target being tracked), robustness to occlusion (tracking should not be lost because of temporary target occlusion, but resumed correctly when the target re-appears), few or no false positives and negatives, agility (the tracker should follow targets moving with significant speed and acceleration), and stability (the lock and accuracy should be maintained indefinitely over time).

We now review briefly the existing classes of tracking systems, from the image processing point of view, in increasing order of target complexity, culminating with methods learning the shape and dynamics of non-rigid targets.

1) Window tracking This is usually performed adopting the same correlation-like

measures as for stereo matching, so we can say they are practically the same thing. However, in the context of tracking, some patterns of grey levels are better than others as they guarantee better numerical properties [124].

2) Feature tracking The next more complex targets are image elements with

specific properties, called image features, that we define as detectable parts of an image that can be used to support a vision task. Notice that this definition is entirely functional to our discussion, and does not capture the many usages of the word “feature”. Feature tracking takes place by first locating features in two subsequent frames, then matching each feature in one frame with one feature in the other (if such a match exists). Notice that the two processes are not necessarily sequential.

Tracking local features. Local features cover limited areas of an image (e.g., edges, lines, corner points). Their key advantage over image windows is that local features are invariant, within reasonable limits, to image changes caused by scene changes, and can be expected to remain detectable over more frames. Moreover, extracting feature reduces substantially the volume of data passed on for further processing. The disadvantages of local features are a limited suggestiveness (i.e., local features may not correspond to meaningful parts of 3-D objects), the fact that they do not appear only on the target object (e.g., edges and corners are generally detected on both target and background, increasing the risk of false associations), and sensitiveness to clutter. Typical image elements used as local features are intensity edges [125][126], lines [127], and the ever-popular corners [128][129]. A good review of local feature tracking is given in [130]. A typical use in computer vision of local features tracking is in the field of structure from motion from video sequences, and a major application in the media technology is given by the BouJou [59], or, more related to the scope of this review, the already mentioned head-tracking.

Tracking extended features. Extended features cover a larger part of the image than local features; they can be

contours of basic shapes (e.g., ellipses, rectangles), or free-form contours, or image regions. Typical examples of the last two classes can be head and eyes. The main advantage of extended features is their greater robustness to clutter [131], as they rely on a larger image support. In other words, one expects a higher risk of false positives and negatives with local features than with extended ones. Another advantage is that extended features are related more directly to significant 3-D entities (e.g., circles in space always appear as ellipses in images). The price is more complex matching and motion algorithms. Extended features, differently from local ones, can change substantially over time: consider, for instance, the image of a person walking. Here a contour tracker must incorporate not only a motion model, but also a shape deformation model constraining the possible deformations. Considering that a discrete contour can be formed by several tens of pixels, the search space can grow unwieldy large. Extended image features can change because the corresponding 3-D entities are moving rigidly, as for a rotating circle, or changing shape, as for a walking human. Deformable objects combine both effects. Clearly, it is easier to predict the appearance of moving, rigid 3-D shapes [132][133][134][135] [136] than that of moving and deforming 3-D objects [137] [138][139][140]. Devising sufficiently general models for the latter is very difficult. For this reason several authors have turned to visual learning techniques [141] [142] or use various templates [143]. Tracking targets are frequently parts of the human body in immersive communications, for instance the head [144] or eyes (see Fig.24), hands [145], and legs [146].

Fig.24: Example of eye tracking in a videoconferencing sequence.

Page 14: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

14

IV. CONCLUSIONS The main objective of this paper was to offer an overview of

immersive communication applications, especially videoconference and ITV, together with the 3D video processing techniques that they require. We hope that our classification and discussion of topics, from presence and immersion to shared virtual table environments, can help the reader to identify and understand the main recent and future streams of research in this area. The examples mentioned in the first part have illustrated different approaches taken in the development of recent, advanced prototypes. The main message here is that the combination of immersive telepresence with shared virtual environments, as emerging in very recent videoconference and ITV prototypes, has created a powerful, new paradigm for immersive communications, possibly the base of killer applications for the associated markets.

The main message of the second part is that truly immersive systems require real-time, highly realistic video with continuously adaptive viewpoint, and this can nowadays be achieved by image-based rendering techniques (as opposed to traditional, polygon-based graphics). The key image processing techniques we identified as necessary for future, 3D immersive systems are calibration, multiview analysis, view synthesis and tracking. For each technique, we identified the key problem and discussed some representative solutions from the sometimes very large number of existing approaches. The last three are likely to see a substantial push in applied research, to achieve algorithms meeting the demanding needs of immersive communications.

REFERENCES [1] W.A. Ijsselsteijn, M. Lombard, J. Freeman: "Toward a Core

Bibliography of Presence", Cyper Psychology & Behaviour, Vol. 4, No. 2, 2001.

[2] W.A. Ijsselsteijn, J. Freeman, H. De Ridder: "Presence: Where are we?", Cyper Psychology & Behaviour, Vol. 4, No. 2, 2001.

[3] L. Schäfer, S. Küppers: „Camera Agents in a Theatre of Work”, Proc. of the 2002 International Conference on Intelligent User Interfaces (IUI02), San Francisco, CA., January 13-16, 2002. New York: ACM Press, 2002, pp. 218-219.

[4] Buß, R.; Mühlbach, L.; Runde, D.: Advantages and disadvantages of virtual environments for supporting informal communication in distributed workgroups. Proc. of Human Computer Interaction International , 1999, München.

[5] http://www.cs.ucl.ac.uk/research/vr/Projects/Cave/ [6] ACTS: “COVEN: COllaborative Virtual Environments”,

http://www.crg.cs.nott.ac.uk/research/projects/coven [7] V. Normand, C. Babski, S. Benford, A. Bullock, S. Carion, Y.

Chrysanthou, N. Farcet, E. Frécon, J. Harvey, N. Kuijpers, N. Magnenat-Thalmann, S. Raupp-Musse, T. Rodden, M. Slater, G. Smith, A. Steed, D. Thalmann, J. Tromp, M. Usoh, G. Van Liempd, N. Kladias: “The COVEN project: exploring applicative, technical and usage dimensions of collaborative virtual environments” PRESENCE: Teleoperators and Virtual Environment 8(2), 1997

[8] O. Ståhl: "Meetings for real - Experiences from a series of VR-based project meetings", Symposium on Virtual Reality Software and Technology, UCL, London, December 1999

[9] D. Sandin et al.: "A Realistic Video Avatar System for Networked Virtual Environments", Immersive Projection Technology Symposium, Orlando, Florida, March 2002

[10] S. Rauthenberg, P. Kauff and A. Graffunder: “The Virtual Meeting Room”, Proc, 3rd Int. Workshop on Presence, March 2000, Delft (NL)

[11] home page of ACTS project MoMuSys: http://www.cordis.lu/infowin/acts/rus/projects/ac098.htm

[12] home page of IST project SonG: http://www.octaga.com/SoNG-Web/ [13] AG Alliance: “Access Grid”, Home Page, http://www-

fp.mcs.anl.gov/fl/accessgrid/ag-spaces.htm [14] VRVS: “Virtual Room Video-Conferencing System”, Home Page at

http://www.vrvs.org/About/index.html [15] M. Chen: “Design of a Virtual Auditorium”, Proc. of ACM Multimedia

2001, Ottawa, Canada, Sept. 2001 [16] Fuqua School of Business: “Global Conference Systeme”, Press

Release, Duke University, May 2002, http://www.fuqua.duke.edu/admin/extaff/news/global_conf_2002.htm

[17] D+S Sound Labs Inc.: “The Plasma-Lift A/V Conference Table”, http://www.dssoundlabs.com/avtable.htm

[18] Home Page of Teleportec Ltd.: www.teleportec.com [19] W.C. Chen et al.: “Toward a Compelling Sensation of Telepresence:

Demonstrating a portal to a distant (static) office”, Proc. of IEEE Visualization 2000, Salt Lake City, UT, USA, Oct. 2000

[20] T. Aoki et. al: “MONJUnoCHIE System : Videoconference System with Eye Contact for Decision Making”, Int. Workshop on Advanced Image Technology (IWAIT), 1999

[21] H. Towles, W.-C. Chen, R. Yang, S.-U. Kum, H. Fuchs, N. Kelshikar, J. Mulligan, K. Daniilidis, L. Holden, B. Zeleznik, A. Sadagic, J. Lanier: "3D Tele-Immersion Over Internet2," Int. Workshop on Immersive Telepresence (ITP2002), Juan Les Pins, France, 6th December 2002

[22] British Telecom: „VIRTUE HOME”, European Union's Information Societies Technology Programme, Project IST-1999-10044, http://www.virtue.eu.com

[23] O. Schreer, P. Sheppard: „VIRTUE - The Step Towards Immersive Tele-Presence in Virtual Video Conference Systems“, Proc. eWorks 2000, Madrid, September 2000

[24] web-site of Takeo Kanade’s Superbowl 2001EyeVision project: http://www.ri.cmu.edu/events/sb35/tksuperbowl.html

[25] Y. Ruigang, D. Gotz, J. Hensley, H. Towles, M. Brown: “PixelFlex: A Reconfigurable Multi-Projector Display System”, IEEE Visualization 2001. San Diego, CA, October 2001

[26] G. Welch, H. Fuchs, R. Raskar, M. Brown, H. Towles: “Projected Imagery In Your Office in the Future”, IEEE Computer Graphics and Applications, July/August 2000, pp.62-67

[27] N. Lodge: “Being Part of the Fun – Immersive Television”, Proc. of Conference of Broadcast Engineering Society of India, New Dehli, February 1999

[28] N.Lodge, D. Harrison: “Being Part of the Action - Immersive Television!”, Proc. of Int. Broadcasting Convention (IBC’99), Amsterdam, September 1999

[29] C. Fehn, P. Kauff, O. Schreer, R. Schäfer: „Interactive Virtual View Video for Immersive TV Applications“, Proc. of Int. Broadcasting Convention, (IBC’00), Amsterdam, September 2000

[30] C. Fehn, E. Cooke, O. Schreer, P. Kauff: "3D Analysis and Image-Based Rendering for Immersive TV Applications", Signal Processing: Image Communication Journal, Special Issue on Image Processing Techniques for Virtual Environments and 3D Imaging, Oct. 2002

[31] T. Pintaric, U. Neumann, A. Rizzo: "Immersive Panoramic Video", Proc. of the 8th ACM International Conference on Multimedia, pp. 493-494, October 2000

[32] M. Op de Beeck, P. Wilinski, C. Fehn, P. Kauff: " Towards an Optimized 3D Broadcast Chain", ITCOM 2002, 3D-TV, Video & Display, SPIE Int. Symposium, Boston, Massachusetts, August 2002

[33] C. Fehn, P. Kauff, M. Op de Beeck, F. Ernst, W. Ijsselsteijn, M. Pollefeys, L. Vangool, E. Ofek, I. Sexton: "An Evolutionary and Optimised Approach on 3D-TV", Proc. of IBC 2002, Int. Broadcast Convention, Amsterdam, Netherlands, Sept. 2002

[34] C. Fehn, P. Kauff: " Interactive Virtual View Video (IVVV) – The Bridge Between 3D-TV and Immersive TV", ITCOM 2002, 3D-TV, Video & Display, SPIE Int. Symposium, Boston, Massachusetts, August 2002

[35] T. Fujii, M Tanimoto: “Free-viewpoint TV system based on ray-space representation”, ITCOM 2002, 3D-TV, Video & Display, SPIE Int. Symposium, Boston, Massachusetts, August 2002

Page 15: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

15

[36] T. Kanade, P. Rander, S. Vedula, H. Saito: “Virtualized Reality: Digitizing a 3D Time-Varying Event As Is and in Real Time”, Mixed Reality, Merging Real and Virtual Worlds, Y. Ohta, H. Tamura, ed., Springer-Verlag, 1999, pp. 41-57

[37] T. Kanade, P. Rander, and P.J. Narayanan: “Virtualized Reality: Constructing Virtual Worlds from Real Scenes”, IEEE Multimedia, Immersive Telepresence, Vol. 4, No. 1, January, 1997, pp. 34-47

[38] NHK: “Free Viewpoint Video Representation System”, http://www.nhk.or.jp/strl/open2002/en/tenji/id08/08index.html

[39] O. Faugeras. Three-Dimensional computer vision: a geometric viewpoint. MIT Press, 1993

[40] E. Trucco and A. Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hall, 1998

[41] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal of Robotics and Automation, RA-3(4):323–344, August 1987

[42] O. Faugeras and G. Toscani. Camera calibration for 3D computer vision. In Proceedings of International Workshop on Machine Vision and Machine Intelligence, February 1987

[43] J. Heikkillä and O. Silvén. A four-step camera calibration procedure with implicit image correction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1106–1112, 1997

[44] O. Faugeras and G. Toscani. The calibration problem for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15–20, 1986

[45] Z. Zhang. Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the International Conference on Computer Vision, pages 666–673, 1999

[46] Z. Zhang. Camera calibration with one-dimensional objects. In Proceedings of the European Conference on Computer Vision, volume IV, page 161, 2002

[47] B. Kamgar-Parsi and R.D. Eastman. Calibration of a stereo system with small relative angles. CVGIP, 51(1. July 1990):1–19, July 1990

[48] F. Pedersini, A. Sarti, and S. Tubaro. Accurate and simple geometric calibration of multi-camera systems. SP, 77(3):309–334, September 1999

[49] H.G. Maas. Image sequence based automatic multi-camera system calibration techniques. International Archives of Photogrammetry and Remote Sensing, 32, 1998

[50] P. Baker and Y. Aloimonos. Complete calibration of a multi-camera network. In Proceedings of the IEEE Workshop on Omnidirectional Vision, 2000

[51] Z. Zhang. Determining the epipolar geometry and its uncertainty: a review. International Journal of Computer Vision, 27(2):161–195, March 1998

[52] R. I. Hartley and A. Zisserman. Multiple view geometry. Cambrige University Press, 2000

[53] T. Moons. A guided tour through multiview relations. In Proceedings of SMILE Workshop, Lecture Notes in Computer Science, vol. 825, pg. 297-316, 1998

[54] S. J. Maybank and O. Faugeras. A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2):123–151, 1992

[55] Q. T. Luong and T. Vieville. Canonical representations for the geometries of multiple projective views. Computer Vision and Image Understanding, 64(2):193–229, 1996

[56] A. Azarbayejani and A. P. Pentland. Recursive estimation of motion, structure and focal length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6):562–575, June 1995

[57] M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. In Proceedings of the International Conference on Computer Vision, pages 90–95, 1998

[58] A.Fusiello. Uncalibrated Euclidean reconstruction: A review. Image and Vision Computing, 18(6-7):555–563, 2000

[59] http://www.2d3.com/ [60] O. Schreer, N. Brandenburg, S. Askar and P. Kauff. Hybrid recursive

matching and segmentation-based postprocessing in real-time immersive video conferencing. Proceedings of the Conference on Vision, Modeling and Visualization, 2001

[61] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondence, European Conference on Computer Visions, 1994

[62] K. Muhlmann, D. Maier, J. Hesser and R. Manner. Calculating dense disparity maps from color stereo images, an efficient implementation. International Journal of Computer Vision, 47(1/2/3), pf 79-88, 2002

[63] K. Konolige. The SRI Small Vision System: http://www.ai.sri.com/~konolige/svs

[64] J. W. Shade. Layered depth images. In Proceedings of SIGGRAPH, 1998

[65] J. Snyder and J. Lengyel. Visibility sorting and compositing without splitting dor image layer decomposition, In Proceedings of SIGGRAPH, 1998

[66] E. Trucco, F. Isgro` and F. Bracchi. Plane detection in disparity space. Proceedings of the IEE International Conference on Visual Information Engineering, pg. 73-76, 2003

[67] http://www.ri.cmu.edu/labs/lab_62.html [68] H. Fuchs, G. Bishop, K. Arthur, L. McMillan, R. Bajcsy, S.W. Lee, H.

Farid and T. Kanade. Virtual space teleconferencing using a sea of cameras. Proceedings of the First International Symposium on Medical Robotics and Computer Assisted surgery, Pittsburgh, PA, 1994

[69] J.K. Cheng and T.S. Huang. Image registration by matching relational structures. Pattern Recognition, 17(1):149–159, 1984

[70] R. Horaud and T. Skordas. stereo correspondence through feature grouping and maximal clique. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1168–1180, 1989

[71] S. Ullman. The interpretation of visual motion. MIT Press, 1989. [72] D. Tell and S. Carlsson. Combining appearance and topology for wide

baseline matching. In Proceedings of the European Conference on Computer Vision, volume I, pages 68–81, 2002

[73] M. Pilu. A direct method for stereo correspondence based on singular value decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 261–266, 1997

[74] A.Goshtasby, S.H. Gage, and J.F. Bartholic. A two stage cross correlation approach to template matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(3):374–378, 1984

[75] C.H. Chou and Y.C. Chen. Moment-preserving pattern matching. Pattern Recognition, 23(5):461–474, 1990

[76] M. Pilu and F. Isgrò. A fast and reliable planar registration method with applications to document stitching. In Proceedings of the British Machine Vision Conference, 2002

[77] J. Mulligan, V. Isler and K. Daniilidis. Performance evaluation of stereo for tele-presence. In Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pg. 558-565, 2001

[78] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1-3):7–42, April 2002

[79] A. Fusiello, E. Trucco and A. Verri. Efficient stereo with multiple windowing. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. Pg. 858-863, 1997

[80] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: theory and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), pg. 920-932, 1994

[81] K. Lengwehasarit and A. Ortega. Probabilistic partial-distance fast matching algorithms for motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 11(2), pages 139-152, 2001

[82] K. Muhlmann, D. Maier, J. Hesser and R. Manner. Calculating dense disparity maps from color stereo images, an efficient implementation. International Journal of Computer Vision, 47(1/2/3), pf 79-88, 2002

[83] M. Perez and F. Cabestaing. A comparison of hardware resources required by real-time stereo dense algorithms. Proceedings of the IEEE International Workshop on Computer Architecture for Machine Perception, 2003

[84] J. Mulligan, V. Isler and K. Daniilidis. Trinocular stereo: a real-time algorithm and its evaluation. International Journal of Computer Vision, 47(1/2/3), pf 51-61, 2002

[85] M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4), pg. 353-363, 1993

Page 16: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

16

[86] S.B. Kang, R. Szeliski and J. Chai. Handling occlusions in dense multi-view stereo. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2001

[87] J. Matas, O. Chum, M. Urban abd T. Pajdla. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In Proceedings of the British Machine Vision Conference, 2002, pp. 384-393

[88] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 19(5), May, 1997

[89] A. Baumberg. Reliable feature matching across widely separated views. In Proceedings IEEE Int.Conf. on Comp. Vision and Pattern Recognition, 2000, vol I, pp. 774-781

[90] C. Tomasi and R. Manduchi. Stereo matching as nearest-neighbor problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):333–340, 1998

[91] S. Crossley, N.A. Thacker, and N.L. Seed. Benchmarking of bootstrap temporal stereo using statistical and physical scene modelling. In Proceedings of the British Machine Vision Conference, pages 346–355, 1998

[92] L. Matthies and M. Okutomi. Bootstrap algorithms for dynamic stereo vision. In Proceedings of the 6th Mutidimensional Signal Processing Workshop, pages 12–22, 1989

[93] M. O’Neil and M. Demos. Automated system for coarse to fine pyramidal area correlation stereo matching. Image and Vision Computing, 14:225–136, 1996

[94] F. Isgrò, E. Trucco, and L.Q. Xu. Towards teleconferencing by view synthesis and large-baseline stereo. In Proceedings of the IAPR International Conference in Image Analysis and Processing, September 2001, pp. 198–203

[95] S. B. Kang. A survey of image-based rendering techniques. Technical Report 97/4, Digital Equipment Corporation, Cambridge Research Laborartory, 1997

[96] L. McMillan and G. Bishop. Plenoptic modeling: an image-based rendering system. In Proceedings of SIGGRAPH95, pg. 39-46, 1995

[97] Z. Zhang. Image-based geometrically-correct photorealistic scene/object modeling: a review. In Proceedings of Asian Conference on Computer Vision, pages 231–236, 1998

[98] T. Whitted. Overview of ibr: Software and hardware issues. In Proceedings of the IEEE International Conference on Image Proc essing, volume 2, pages 1–4, 2000

[99] V. Popescu, A. Lastra, D. Alliaga and M. de Oliveira Neto. Efficient warping for architectural walkthroughs using layered depth images. In Proceedings of the IEEE Visalization’98, pg. 211-215, 1998

[100] V. Popescu. Forward rasterization: a reconstruction algorithm for image-based rendering. PhD thesis, University of North Carolina at Chapel Hill, 2001

[101] J. Torborg and J. Kajiya. Talisman: commodity real-time 3D graphics for the PC. In Proceedings of SIGGRAPH, pg. 353-363, 1998

[102] S. E. Chen and L. Williams. View interpolation for image synthesis. In Proceedings of SIGGRAPH 93, pages 279–288, 1993

[103] S. E. Chen. Quicktime VR - an image-based approach to virtual environment navigation. In Proceedings of SIGGRAPH 95, 1995

[104] http://www.tnt.uni-hannover.de/project/eu/panorama/overview.html [105] S. M. Seitz and C. R. Dyer. Physically-valid view synthesis by image

interpolation. In Proceedings of the IEEE Workshop on Representations of Visual Scenes, pages 18–25, 1995

[106] L. McMillan and G. Bishop. Plenoptic modeling: an image-based rendering system. In Proceedings of SIGGRAPH95, pg. 39-46, 1995

[107] S.J. Gortler, R. Grzesczuk, R. Szeliski and M.F. Cohen. The lumigraph. In Proceedings of SIGGRAPH’96, pg. 43-54, 1996

[108] M. Levoy and P. Hanrahan. Light field rendering. In Proceedings of SIGGRAPH’96, pg. 31-42, 1996

[109] S. Pollard, M. Pilu, S. Hayes, and A. Lorusso. View synthesis by trinocular edge matching and transfer. In Proceedings of the British Machine Vision Conference, pages 770–779, 1998

[110] S. Laveau and O. Faugeras. 3-D scene representation as a collection of images. In Proceedings of the IAPR International Conference on Pattern Recognition, pages 689–691, 1994

[111] S. Avidan and A. Shashua. Novel view synthesis in tensor space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1034–1040, 1997

[112] A. Shashua. Illumination and view position in 3D visual recognition. In S.E. Moody, S.J. Hanson, and R.P. Lippman, editors, Advances in Neural Information Processing Systems, pages 404–411. Morgan Kaufmann Publishers, 1992

[113] R. Basri and D.W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Transactions on attern Analysis and Machine Intelligence, 25(2), pg. 218-233, 2003

[114] K.J. Dana, B. van Ginneken, S.K. Nayar, and J.J. Koenderink. Reflectance and texture of real-world surfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 151–157, 1997

[115] Y. Yu, P. Debevec, J. Malik, and T. Hawkins. Inverse global illumination: recovering reflectance models of real scenes from photographs. In Proceedings of SIGGRAPH, 1999

[116] S. Boivin and A. Gagalowicz. Image-based rndering of diffuse, specular and glossy surfaces from a single image. In Proceedings of SIGGRAPH, 2001

[117] Z. Zhang. Modeling geometric structure and illumination variation of a scene from real images. In Proceedings of the International Conference on Computer Vision, pages 1041–1046, 1998

[118] Y. Mukaigawa, S. Mihashi, and T. Shakunaga. Photometric image-based rendering for virtual lighting image synthesis. In Proceedings of the IEEE and ACM International Workshop on Augmented Reality, 1999

[119] Y. Mukaigawa, H. Miyaki S. Mihashi, and T. Shakunaga. Photometric image-based rendering for image generation in arbitratry illumination. In Proceedings of the International Conference on Computer Vision, 2001

[120] A. Fitzgibbon. Stochastic rigidity: Image registration for nowhere-static scenes. Proceedings of the International Conference on Computer Vision 2001, Volume 1, Pages 662-670, 2001

[121] M. La Cascia, S. Sclaroff, and V. Athitsos. Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Robust Registration of Texture-Mapped 3D Models. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 22(4), April, 2000

[122] S. Birchfield. Elliptical Head Tracking Using Intensity Gradients and Color Histograms. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, California, pages 232-237, June 1998

[123] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller and D. Colucci. The HiBall tracker: high-performance wide-area tracking for virtual and augmented environments. In Proceedings of the ACM Symposium ov Virtual Reality software and Technology, pg. 20-22, 1999

[124] J. Shi and C. Tomasi. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 593–600, 194

[125] L. Bretzner and T. Lindeberg. Feature tracking with automatic selection of spatial scale. Computer Vision and Image Understanding, 71(3):385-391, 1998

[126] H. Gu, M. Asada and Y. Shirai. The optimal partition of moving edge segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 367-372, 1993

[127] R. Deriche and O. Faugeras. Tracking line segments. In Proceedings of the European Conference on Computer Vision. Pg. 259-268, 1990

[128] H. Wang and M. Brady. Real-time corner detection algorithm for motion estimation. Image and Vision Computing. 13(9):695-705, 1995

[129] J. Shi and C. Tomasi. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pg. 593-600, 1994

[130] S. Smith. Literature review on feature-based tracking approaches. In Cvonline, http://www.dai.ed.ac.uk/CVonline/motion.htm, 1999

[131] M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998

[132] K. Kanatani. Geometric Computation for Machine Vision. Oxford University Press, 1993

[133] D. Lowe. Robust model-based motion tracking through the integration of search and estimation. International Journal of Computer Vision, 8:113–122, 1992

[134] M. Pilu, A. W. Fitzgibbon, and R. B. Fisher. Ellipse-specific least-squares fitting. In Proceedings of the International Conference on Pattern Recognition, 1996

Page 17: 3-d image processing in the future of immersive media - ResearchGate

PAPER IDENTIFICATION NUMBER #11

17

[135] E. Marchand and G. D. Hager. Dynamic sensor planning in visual servoing. In Proceedings of the IEEE International Conference on Robotics and Automation, volume 1, pages 1988–1993, 1998

[136] C. E. Smith and N.P. Papanikolopoulos. Grasping of static and moving objects using a vision-based control approach. Journal of Intelligent and Robotic Systems, 19:237–270, 1997

[137] A. Blake and M. Isard. Active Contours. Springer-Verlag, London, 1998 [138] Y. Ricquebourg and P. Bouthemy. Real-time tracking of moving

persons by exploiting spatio-temporal image slices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):797–808, 2000

[139] L. Torresani and c. Bregler. Space-time tracking. In Proceedings of the European Conference on Computer Vision, vol. I pg. 801-812 2002

[140] M. Brand. Morphable 3D models from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol II, pg. 456-633, 2001

[141] S. Avidan. Support vector tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–310, 2001.

[142] M. Pontil and A. Verri. Object recognition with support vector machines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:637–646, 1998.

[143] T. Schoepflin, V. Chalana, D.R. Haynor, K. Yongmin. Video object tracking with a sequential hierarchy of template deformations. IEEE Transactions on Circuits and Systems for Video Technology, 11(11), pages 1171-1182, 2001

[144] R. Yang and Z. Zhang. Model based head tracking with stereo vision. Proceeding of the IEEE International Conference on Automatic Face and Gesture Recognition, pages 112-117, 2002

[145] S. Malik, C. McDonald and G. Roth. Hand tracking for interactive pattern-based augmented reality. Proceedings of the International Symposium on Mixed and Augmented Reality, pages 117-126, 2002

[146] C.C. Chang and W.H. Tsai. Vison-based tracking and interpretation of human leg movement for virtual reality applications. IEEE Transactions on Circuits and Systems for Video Technology, 11(1), pages 9-24, 2001

Francesco Isgrò obtained his Laurea in Mathematics from Università di Palermo (Italy) in 1994, and the PhD in Computer Science from Heriot-Watt University (Scotland) in 2001. From 2000 to 2002 he was Research Associate at the Department of Computing and Electrical Engineering, Heriot-Watt University. He is now Research Associate at Dipartimento di Informatica e Scienze dell’Informazione, Università di Genova. He also is part-time lecturer at Università di Palermo.His current research interests are image-based rendering and applications to

videoconferencing, 3D analysis, image registration.

Emanuele Trucco obtained his BSc (1984) and PhD (1990) degrees from the University of Genoa, both in Electronic Engineering. He is now a Reader in the School of Engineering and Physical Sciences at Heriot-Watt University. His current interests are in multiview stereo, motion analysis, image-based rendering, and applications to videoconferencing, medical image processing, and subsea robotics. Dr Trucco has managed grants in excess of 1M pounds from the EU, EPSRC, various foundations (e.g., Royal Society, British Council) and industry. He has served on professional, technical and organising

committees of several conferences (e.g., BMVC'96, SIRS'98, SPIE'99, IEEE CVPR'99, IEEE SMC 2003, IEEEIAPR ICIAP'03). He is a Honorary Editor of the IEE Proceedings on Vision, Image and Speech Processing. He has published more than 100 refereed publications and co-authored a book widely adopted by the international community. Press reports include New Scientist, the Financial Times, and an invited participation in the BBC Tomorrow's World Roadshow 2002.

Peter Kauff received his Diploma degree in Electrical Engineering and Telecommunication from Technical University of Aachen, Germany in 1984. He is the head of the “Immersive Media & 3D Video” Group in the Image Processing Department at Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut (FhG/HHI), Berlin, Germany. He has been with Heinrich-Hertz-Institute since 1984, involved in numerous

German and European projects related to digital HDTV signal processing and coding, interactive MPEG-4-based services, as well as in a number of projects related to advanced 3D video processing for immersive telepresence and immersive media. He has been engaged in several European research projects, such as EUREKA 95, RACE-project FLASH, ACTS-project HAMLET, COST211ter&quad, ESPRIT-Project NEMESIS, IST-Projects VIRTUE and ATTEST and the Presence Working Group of IST-FET Proactive Initiative. Mr. Kauff is reviewer of several IEEE and IEE publications.

Oliver Schreer graduated in Electronics and Electrical Engineering at the Institute of Measurement and Automation of Technical University of Berlin in 1993. In November 1999, he finished his PhD in electrical engineering at the Technical University of Berlin. From 1993 until 1998, he was assistant teacher at the Institute of Measurement and Automation, in the faculty of Electrical Engineering, at Technical University Berlin. He was responsible for lectures and practical courses in the field of image processing and

pattern recognition. His research interests have been camera calibration, stereo image processing, 3D analysis, navigation and collision avoidance of autonomous mobile robots. Since August 1998, he is working as project leader in the “Immersive Media & 3D Video”- Group, Image Processing Department at Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut (FhG/HHI), Berlin, Germany. In this context he is engaged in research for 3D analysis, novel view synthesis, real-time video conferencing systems and immersive TV applications. He was the responsible person at FhG/HHI within the European FP5 IST project VIRTUE and leader of the “Real-time” work package. Since autumn 2001, he is Adjunct Professor at the Faculty of Electrical Engineering and Computer Science, Technical University Berlin.

Dr. Schreer is guest editor for the IEEE Transactions on Circuits, Systems and Video Technology. He is reviewer for several IEEE and IEE journals.