real time 3d avatar for interactive mixed reality

6
Real Time 3D Avatar for Interactive Mixed Reality Sang-Yup Lee *,† , Ig-Jae Kim * , Sang C Ahn * , Heedong Ko * , Myo-Taeg Lim , Hyoung-Gon Kim * * Imaging Media Research Center, Korea Institute of Science and Technology School of Electrical Engineering, Korea University Abstract This paper presents real-time reconstruction of dynamic 3D avatar for interactive mixed reality. In computer graphics, one of the main goals is the combination of virtual scenes with real-world scenes. However, the views of the real world objects are often restricted to views from the cameras. True navigation through such mixed reality scenes becomes impossible unless the components from real objects can be rendered from arbitrary viewpoints. Additionally, adding a real-world object to a virtual scene requires some depth information as well, in order to handle interaction. The proposed algorithm introduces an approach to generate 3D video avatars and to augment the avatars naturally into 3D virtual environment using the calibrated camera parameters and silhouette information. As a result, we can create photo-realistic live avatars from natural scenes and the resulting 3D live avatar can guide and interact participants in VR space. CR Categories: I.3.2 [Computer Graphics]: Picture/Image Generation – Viewing Algorithms Keywords: image based rendering, 3D reconstruction, IR keying, interaction, mixed reality 1 Introduction Over last few years, various research activities on avatar generation have been reported. At first, a CG-based avatar has been proposed but it is used in limited applications due to its weakness, i.e., the lack of reality. To overcome the weakness, an image based avatar has been developed, where the texture of the avatar is segmented from background image and then augmented into virtual world. However, augmenting 2D video avatars onto virtual world is unnatural, since 2D video avatars do not have 3D informations. As a result, it is difficult to allow 2D video avatars to interact with real or virtual objects. To relieve these problems, 3D video avatars have been developed. The problem of reconstructing 3D objects from multiple images has been investigated for decades. Some methods like Depth from Stereo (DfS) and Shape from Silhouette (SfS) can be fully automated and are promising candidates for online 3D reconstruction systems. -------------------------------------------- e-mail: [email protected] The SfS method has become quite popular recently because the result is robust compared to the noisy output of DfS. More importantly, a consistent 3D model can be directly obtained using SfS, while in the stereo case, one must post-process the depth results from different image pairs in order to obtain a consistent 3D model. Recently, SfS approach has been successfully used in real time systems[Markus][Matusik00][Lok]. The reconstruction result of this approach is the Visual Hull[Laurentini], an approximate shell that envelopes the true geometry of the object. The visual hull allows real-time reconstruction and rendering, yet some improvements are still possible to obtain faster and better rendering results. But, most of them uses background subtraction to extract the silhouette information assuming that the background is static. Moreover, the silhouette used for reconstruction process is approximated for fast rendering speed. The proposed algorithm introduces an approach to generate 3D video avatars and to augment the avatars naturally into 3D virtual environment using exact silhouette information. As a result, we can create photo-realistic live avatars from natural scenes and interact virtual objects in mixed reality space. 2 Related Work Several approaches have been proposed to insert a live actor into a virtual scene at interactive rates. 3D information can be retrieved using stereo cameras which generate range images. This technique is able to reconstruct concave regions and the Virtualized Reality system shows that it can work on large dynamic objects. Matusik computes a visual hull from multiple camera views, using epipolar geometry, and generates a 3D textured model (each camera retrieving a colored silhouette). Geometry information can also be retrieved by Voxel Coloring: it consists in testing color consistency of voxels from the surface of the geometry among all the points of view. Another approach is to use voxel carving: a volume is discretized into voxels, which can be used as a step to fit a skeleton, or to analyze human movements by fitting ellipsoids. An interesting application by Lok is to combine advanced interactions within the virtual world. These methods have some drawbacks: concave regions which are difficult to capture or the number of views which are needed, but combining them as in Li can reduce these problems and can improve the speed as well as the quality of the estimated geometry. All these methods are working at interactive rates, but not at 30Hz on full VGA resolution images and with standard cameras. 3 3D Avatar Generation The reconstruction algorithm has three steps: image processing to label object pixels, calculating the volume intersection, and rendering the visual hull. Fixed cameras are positioned with each camera’s frustum completely containing the volume to be reconstructed. Because the results of image segmentation are sensitive to shadows and light condition, we use IR(infrared) keying to extract pixels of interest from the background. The

Upload: kist-kr

Post on 25-Nov-2023

2 views

Category:

Documents


0 download

TRANSCRIPT

Real Time 3D Avatar for Interactive Mixed Reality

Sang-Yup Lee*,†, Ig-Jae Kim*, Sang C Ahn*, Heedong Ko*, Myo-Taeg Lim†, Hyoung-Gon Kim* *Imaging Media Research Center, Korea Institute of Science and Technology

† School of Electrical Engineering, Korea University Abstract This paper presents real-time reconstruction of dynamic 3D avatar for interactive mixed reality. In computer graphics, one of the main goals is the combination of virtual scenes with real-world scenes. However, the views of the real world objects are often restricted to views from the cameras. True navigation through such mixed reality scenes becomes impossible unless the components from real objects can be rendered from arbitrary viewpoints. Additionally, adding a real-world object to a virtual scene requires some depth information as well, in order to handle interaction. The proposed algorithm introduces an approach to generate 3D video avatars and to augment the avatars naturally into 3D virtual environment using the calibrated camera parameters and silhouette information. As a result, we can create photo-realistic live avatars from natural scenes and the resulting 3D live avatar can guide and interact participants in VR space. CR Categories: I.3.2 [Computer Graphics]: Picture/Image Generation – Viewing Algorithms Keywords: image based rendering, 3D reconstruction, IR keying, interaction, mixed reality 1 Introduction Over last few years, various research activities on avatar generation have been reported. At first, a CG-based avatar has been proposed but it is used in limited applications due to its weakness, i.e., the lack of reality. To overcome the weakness, an image based avatar has been developed, where the texture of the avatar is segmented from background image and then augmented into virtual world. However, augmenting 2D video avatars onto virtual world is unnatural, since 2D video avatars do not have 3D informations. As a result, it is difficult to allow 2D video avatars to interact with real or virtual objects. To relieve these problems, 3D video avatars have been developed. The problem of reconstructing 3D objects from multiple images has been investigated for decades. Some methods like Depth from Stereo (DfS) and Shape from Silhouette (SfS) can be fully automated and are promising candidates for online 3D reconstruction systems. --------------------------------------------

e-mail: [email protected]

The SfS method has become quite popular recently because the result is robust compared to the noisy output of DfS. More importantly, a consistent 3D model can be directly obtained using SfS, while in the stereo case, one must post-process the depth results from different image pairs in order to obtain a consistent 3D model. Recently, SfS approach has been successfully used in real time systems[Markus][Matusik00][Lok]. The reconstruction result of this approach is the Visual Hull[Laurentini], an approximate shell that envelopes the true geometry of the object. The visual hull allows real-time reconstruction and rendering, yet some improvements are still possible to obtain faster and better rendering results. But, most of them uses background subtraction to extract the silhouette information assuming that the background is static. Moreover, the silhouette used for reconstruction process is approximated for fast rendering speed. The proposed algorithm introduces an approach to generate 3D video avatars and to augment the avatars naturally into 3D virtual environment using exact silhouette information. As a result, we can create photo-realistic live avatars from natural scenes and interact virtual objects in mixed reality space. 2 Related Work Several approaches have been proposed to insert a live actor into a virtual scene at interactive rates. 3D information can be retrieved using stereo cameras which generate range images. This technique is able to reconstruct concave regions and the Virtualized Reality system shows that it can work on large dynamic objects. Matusik computes a visual hull from multiple camera views, using epipolar geometry, and generates a 3D textured model (each camera retrieving a colored silhouette). Geometry information can also be retrieved by Voxel Coloring: it consists in testing color consistency of voxels from the surface of the geometry among all the points of view. Another approach is to use voxel carving: a volume is discretized into voxels, which can be used as a step to fit a skeleton, or to analyze human movements by fitting ellipsoids. An interesting application by Lok is to combine advanced interactions within the virtual world. These methods have some drawbacks: concave regions which are difficult to capture or the number of views which are needed, but combining them as in Li can reduce these problems and can improve the speed as well as the quality of the estimated geometry. All these methods are working at interactive rates, but not at 30Hz on full VGA resolution images and with standard cameras. 3 3D Avatar Generation The reconstruction algorithm has three steps: image processing to label object pixels, calculating the volume intersection, and rendering the visual hull. Fixed cameras are positioned with each camera’s frustum completely containing the volume to be reconstructed. Because the results of image segmentation are sensitive to shadows and light condition, we use IR(infrared) keying to extract pixels of interest from the background. The

consequences of image segmentation errors are increased visual hull size for incorrectly labeled object pixels, and holes in the visual hull for incorrectly labeled background pixels. Applying the IR keying algorithm would improve the accuracy and decrease the noise in the reconstructions. 3.1 Camera Calibration Calibration is required to recover positions, orientations and internal parameters of the cameras involved in the reconstruction process. The novel view generation algorithm requires projection of a given point in the 3D space into each of the cameras. Calibration data is gathered by presenting a large checherboard to all of the cameras. The Camera Calibration Toolbox for Matlab® was used to detect all the corners on the checherboard, and to calculate both a set of intrinsic parameters for each camera and a set of extrinsic parameters relative to the checkerboard's coordinate system. Currently, we have plan to use 6 more cameras in the 3D reconstruction system. Note that it is difficult to position the checkerboard efficinetly in this case. So we implement following calibration method mentioned in [Prince]. Where two cameras detect the checkerboard in the same frame, the relative transformation between the two cameras can be calculated. By chaining these estimated transforms together across frames, the transform from any camera to any other camera can bederived.

Figure 1. Camera Calibration Results

3.2 Silhouette Extraction In general, the background is assumed to be static and can be learned a priori information in SfS approaches. They use statistical texture properties of the background observed over extended period of time to construct a model of the background, and use this model to decide which pixels in an input image do not fall into the background class. But, it is difficult to apply background subtraction when background is not static, especially it is display screen in CAVE-like immersive environment. In order to be able to handle dynamic background, the method in [Jaynes] does not capture an initial static reference images for background subtraction. For each frame of image projected by the projectors, the system needs to predict what will be observed by the camera when the screen is not occluded. To help with this prediction, a color calibration procedure is done to estimate the color transfer function between each projector and the camera. This is done by projecting uniform color images of increasing intensity of red, green, or blue from a projector, and capturing them using the

camera. For each frame of image in the projectors' frame buffers, an image is predicted for the camera. This predicted image is color-corrected using the color transfer functions obtained in the color calibration step. Then, the image actually observed by the camera is pixel-wise subtracted from the predicted image, and the result is used to update an alpha image. But, the limitation of the system is that it is too slow to implement in real time. Moreover, the results are not robust in light, camera, and projector conditions. In this paper, we present a fast and robust method for extracting silhouette images. The infrared(IR) is used in our system for extracting silhouette information. An important property of our IR keying system is that the intensity of IR reflected from most object, including human skin and clothes, is lower than the one reflected from retrorelfective screen. Following figure shows our IR keying system.

Figure 2. IR keying System

Figure 3. IR keying Results : (a) is the captured image from IR

camera. (b) is the thresholded binary image. 3.3 3D Reconstruction To perform the intersection tests between the projection volumes is too expensive for even a modest setup. Numerical stability and robustness problems aside, scalability becomes an issue as the number of intersection tests grows rapidly with the number of cameras. We use the computational power of graphics hardware to compute a view specific pixel-accurate volume intersection in real-time. In the naive volume intersection method, the 3D space is partitioned into small voxels, and each voxel is mapped onto each image plane to examine whether or not it is included in the object silhouette. Because of computational complexity, we use the plane-based volume intersection algorithm. The proposed algorithm consists of the following processes: a. Partition the 3D space into a set of equally spaced parallel planes with respect to boundary region of real objects. b. Project the object silhouette observed by each camera onto planes.

c. Compute the intersection of all silhouettes projected on each plane. The processing stages of our 3D reconstruction are presented in the following figure.

Figure 4. 3D Reconstruction Algorithm using Plan Sweep method We use projective textures and the stencil buffer to perform intersection tests, and compute only the elements that could be contained in the final image. To determine if a 3D point is in the visual hull, we render it n times where n is the number of cameras. When rendering the point for the ith time, the texture matrix stack is set to the projection modelview matrix defined by camera i’s extrinsic parameters. This generates texture coordinates that are a perspective projection of image coordinates from the camera’s location. The camera's image is converted into a texture with the alpha of object pixels set to 1 and background pixels set to 0. An alpha test to render only texels with alpha = 1 is enabled. If all n cameras project an object pixel onto the point in 3D space, then the point is within the visual hull. The system uses the stencil buffer to accumulate the number of cameras that project an object pixel onto the point. The pixel’s stencil buffer value is incremented for each camera that projects an object pixel onto the point. Once all n textures are projected, we change the stencil test and redraw the point. If the pixel’s stencil buffer value < n, it means there exists at least one camera where the point did not project onto an object pixel and is not within the visual hull. The stencil and color elements for that pixel are cleared. If the pixel’s stencil buffer value equals n, it is within the visual hull. To improve the speed of the reconstruction algorithm, the boundary information of the real objects is used for reducing the fill rate of the algorithm. 3.4 Texture Mapping To achieve realistic rendering results, we use the projective textures mapping. Projective texture mapping is a method of texture mapping described by [Segal] that allows the texture image to be projected onto the scene as if by a slide projector. But the current hardware implementation of projective texture mapping in OpenGL lets the texture pass through the geometry and be mapped onto all backfacing and occluded polygons on the path of the ray. Thus it is necessary to perform visibility pre-processing so that only polygons visible to a particular camera are texture mapped with the corresponding image. Recently, [Sawhney] implemented the shadow map based visibility algorithm by employing graphics hardware. [Ming]

refers to this algorithm as opaque projective texture mapping. The basic opaque projective texture mapping algorithm can be implemented using multiple texture units. For each reference view, the RGBA color image is loaded into the first texture unit. The shadow map is generated as follows. The scene from the light's point of view(reference view) is rendered, then we use the light's depth buffer as a texture(shadow map). Generated shadow map is loaded into the second texture unit. However, they require rendering the scene from each input camera viewpoint and reading the z-buffer or the frame-buffer. We present an alternative algorithm that computes the visible parts of the visual hull surface from each of the input cameras. The algorithm has the advantages that it is simple, and it can be computed in pre-processing.

Figure 5. The relation between R(j) and ej

Let us assume that we want to compute whether the 3D ray of the visual hull that lie on the edge in silhouette image R(i) is visible from image R(j). We observe that these ray have to be visible from the camera R(j) if the point s is visible from the epipole ej (the projection of the center of R(j) onto the image plane of R(i)). This effectively reduces the 3D visibility computation to the 2D visibility computation. Then we use alpha channel to label the part of edges that are visible from which reference image. To compute color using reference images, we employ the view dependent texture mapping using projective textures. Due to errors involved in the constructed visual hulls, texture on neighboring pixels are often extracted from those images captured by different cameras, which introduces jitters in rendered images. Moreover, there is only a small number of reference images, there are few color samples for each point on the visual hull surface. For this reason, we implement view-dependent texture mapping for rendering. The criterion to choose the weight is the angle deviation between the viewing direction of the reference view and that of the desired view. A smaller angle gets higher weight for high fidelity texture mapping.

Figure 6. Visibility Test results : red and blue lines represent the

visible silhouette edge with respect to different epipoles.

4 Interaction in Mixed Reality The 3D reconstruction results can be used not only for rendering desired scene in a virtual world but also for volumetric effects and interaction. In order to integrate the avatar in a realistic way in a virtual scene we apply shadows. These shadows help understanding relative object position in the virtual scene and increase immersion. We employ a projective shadow approach which is fast, does not depend on geometry complexity. Shadow textures in the scene are precalculated from relation between reference cameras and light. This is done by rendering a black and white shadow texture representing the avatar as seen from the light source and by projection this image on the scene geometry following the light direction. Another advantage to dispose of 3D information is that we can detect collision between the generated 3D avatar and virtual environment. Therefore, complex interaction with virtual environments is allowed by the generated volumetric data. An interesting application by [Lok] is to combine advanced interactions within the virtual world. Real-time 3D avatars of the user enable more natural interactions with elements in virtual environments. The new system allows virtual environments to be truly dynamic and enables completely new ways of user interaction. 5 Results The flat shaded visual hull is useful to visualize which surfaces on the visual hull comes form which silhouette cone. This leads to a good test method of the visual hull reconstruction. In order to render a flat-shaded visual hull, we load the silhouette images as intensity texture into the texture units on the graphics card. In Figure 8, different colors - red, green, and blue - represent different 3D cones generated by back-projection of silhouette images. In this case, the visual hull is reconstructed and rendered from three reference images.

Figure 7. Reference input images : (a) The different colors are

assigned to the input texture. (b) Real color input textures are used.

Figure 8. view dependent blending results (a)

Figure 9. view dependent blending results (b)

Figure 10 shows the reconstructed images by proposed algorithm when 3 video cameras are used. In this figure, white wire frames represent the reference cameras. In our experiment, we use 3 IEEE1394 cameras and only one Pentium 4 processor computer with a GeForce4 graphics card. The desired view direction is controlled by mouse.

Figure 10. 3D reconstructed images using reference cameras

5.1 Heritage Alive! We implemented our 3D reconstruction algorithm to generate 3D live avatar in virtual heritage tour. Virtual heritage technology has become increasingly important in the preservation of cultural history and provide us the virtual exploration of heritage through the immersive and interactive VR environments. In virtual heritage applications, the virtual tour guide gives information on the cultural site. The guide leads visitors and has both technical skills to control the scenario and education skills to explain the

history of sites. Through the guide's avatar, and streaming network audio connections, the guide could then interact with the visitors, pointing out special details and answering questions. Our scenario named "Heritage Alive" enables interactive group exploration of cultural heritage in tangible space. The three main players in Heritage Alive are learners who want to experience and learn cultural heritage, content providers that generate virtual objects of cultural heritage through Tangible Agent technology, and education providers who guide the learning experience in Heritage Alive. A guide who is far away should intervene in learners' environments naturally. To provide both interactivity and visual realism, a 3D avatar using image based rendering is applied in our scenario. A tour guide at the guide space can fully control the scenario and learners’ activity using a guide controller that communicates control commands to the graphic workstation. A view of the guide is decoded to MPEG video and streamed to the graphic workstation that combines video image in virtual environment context. Also a guide sees the learner space and the learners’ response using web-cams. Figure shows our system configuration for Heritage Alive!

Figure 11. System Configuration

Learner space located at KIST Reality Studio includes spherical display system, motion simulator and 3D sound system. Graphic workstation performs visual rendering of stereographic multi-channel display and communicates with external modules such as a device, sound, simulation and streaming server in order to coordinate various functionalities for the target scenario. Summarizing a role of external modules, device server takes charge of peripheral devices such as a motion simulator and input devices. Sound server plays background music or generates a 3-dimensional sound effect according to the scenario and requests of the graphic workstation.

Figure 12. Processing flow of 3D reconstruction in Heritage Alive

Figure 13. Input images captured from different view positions

Figure 14. 3D reconstructed images with respect to virtual camera

direction

Figure 15. Augmented live avatar using billboarding

6 Conclusion We presented an algorithm that generates a real-time view dependent sampling of the visual hull. Our system allows virtual environments to be truly dynamic and enables interaction between real and synthetic objects. The resulting reconstructions are used to generate the realistic 3D guide avatar. Complex interaction with virtual environments is allowed by the generated volumetric data. Moreover, we can generate complex scenes using sets of the images or video streams stored in the database. To minimize any visual discrepancy between real and synthetic objects, shadows can be calculated by reconstructing the scene from the viewpoint of the light, and the resulting image can be used as a shadow texture onto scene geometry. In this way, real objects are affecting the synthetic scene. The proposed system promises to move virtual worlds one step closer to our life by

enabling real time 3D reconstructions of the user and other real objects in an immersive mixed environment. References DEBEVEC, P. E., BORSHUKOV, G., AND YU, Y. 1998. Efficient view-

dependent image-based rendering with projective texture-mapping. In 9th Eurographics Rendering Workshop, 105–116.

MARKUS, G., STEPHAN, W., MARTIN, N., EDOUARD, L., CHRISTIAN, S.,

ANDREAS, K., ESTHER, K., TOMAS, S., LUC, V. G., SILKE, L., KAI, S., ANDREW, V. M., AND OLIVER, S. 2003. blue-c: A Spatially Immersive Display and 3D Video Portal for Telepresence. In Proceedings of ACM SIGGRAPH 2003.

Matsuyama, Wu, X., Takai, T., and Nobuhara, S. 2003. Real-Time

Generation and High Fidelity Visualization of 3D Video. MIRAGE 2003, 1-10.

MATUSIK, W., BUEHLER, C., RASKAR, R., GORTLER, S., AND MCMILLAN,

L. 2000. Image-Based Visual Hulls. In Proceedings of ACM SIGGRAPH 2000, 369-374

MATUSIK, W., BUELER, C., AND MCMILLAN, L. 2000. Polyhedral visual

hulls for real-time rendering. In Proceedings of 12th Eurographics Workshop on Rendering, 115–125.

LI, M., MAGNOR, M., AND SEIDEL, H. 2003. Online Accelerated

Rendering of Visual Hulls in Real Scenes. Journal of WSCG 2003, 11(2): 290-297.

LAURENTINI, A. 2000. The visual hull concept for silhouette-based image

understanding. IEEE Trans. Pattern Anal. Machine Intell., 16(2): 150–162.

LOK, B. 1995. Online Model Reconstruction for Interactive Virtual

Environments. In Proceedings 2001 Symposium on Interactive 3-D Graphics 69-72.

SAWHNEY, H., ARPA, A., KUMAR, R., SAMARASEKERA, S., AGGARWAL,

M., HSU, S., NISTER, D., AND HANNA, K. Video flashlights – real time rendering of multiple videos for immersive model visualization. In Proceedings of 13th Eurographics Workshop on Rendering.

Segal, M., Korobkin, C., Widenfelt, R., Foran, J., and Haeberli, P. 1992.

Fast shadows and lighting effects using texture mapping. In Proceedings of ACM SIGGRAPH 1992, 249–252.

SLABAUGH, G., SCHAFER, R., HANS, M. Image-Based Photo Hulls. In

Proceedings of 3D Data Processing, Visualization, and Transmission. PRINCE, S. J. D., WILLIAMSON, T., JOHNSON, N., CHEOK, A. D., FARBIZ,

F., BILLINGHURST M., AND KATO, H. 2002. 3D Live: Real Time Interaction for Mixed Reality. Computer Supported Collaborative Work.

YANG, R., WELCH, R., BISHOP, G. 2002. Real-Time Consensus-Based

Scene Reconstruction using Commodity Graphics Hardware. In Proceedings of Pacific Graphics 2002.

Wada, T., Wu, X., Tokai, S., and Matsuyama, T. 2002. Homography

Based Parallel Volume Intersection:Toward Real-Time Volume Reconstruction Using Active Cameras. In Proceedings of Computer Architectures for Machine Perception, 331-339.