Connecting users to virtual worlds within MPEG-V standardization

Download Connecting users to virtual worlds within MPEG-V standardization

Post on 01-Jan-2017




0 download

Embed Size (px)


<ul><li><p>Contents lists available at SciVerse ScienceDirect</p><p>Signal Processing: Image Communication</p><p>Signal Processing: Image Communication 28 (2013) 971130923-59</p><p>http://d</p><p>n Corr</p><p>E-mjournal homepage: users to virtual worlds within MPEG-V standardizationSeungju Han, Jae-Joon Han n, James D.K. Kim, Changyeong Kim</p><p>Advanced Media Lab, Samsung Advanced Institute of Technology, Yongin, Republic of Koreaa r t i c l e i n f o</p><p>Available online 17 November 2012</p><p>Keywords:</p><p>MPEG-V</p><p>Virtual World</p><p>3D Manipulation</p><p>Gesture Recognition65/$ - see front matter &amp; 2012 Elsevier B.V</p><p></p><p>esponding author. Tel.: 82 31 280 9443; faxail address: (J.-Ja b s t r a c t</p><p>Virtual world such as Second life and 3D internet/broadcasting services have been</p><p>increasingly popular. A life-scale virtual world presentation and the intuitive interaction</p><p>between the users and the virtual worlds would provide more natural and immersive</p><p>experience for users. The emergence of novel interaction technologies, such as facial-</p><p>expression/body-motion tracking and remote interaction for virtual object manipulation,</p><p>could be used to provide a strong connection between users in the real world and avatars</p><p>in the virtual world. For the wide acceptance and the use of the virtual world, various</p><p>types of novel interaction devices should have a unified interaction format between the</p><p>real world and the virtual world. Thus, MPEG-V Media Context and Control (ISO/IEC</p><p>23005) standardizes such connecting information. The paper provides an overview and</p><p>its usage example of MPEG-V from the real world to the virtual world (R2V) on interfaces</p><p>for controlling avatars and virtual objects in the virtual world by the real world devices.</p><p>In particular, we investigate how the MPEG-V framework can be applied for the facial</p><p>animation and hand-based 3D manipulation using intelligent camera. In addition, in</p><p>order to intuitively manipulate objects in a 3D virtual environment, we present two</p><p>interaction techniques using motion sensors such as a two-handed spatial 3D interaction</p><p>approach and a gesture-based interaction approach.</p><p>&amp; 2012 Elsevier B.V. All rights reserved.1. Introduction</p><p>How we interact with computers in the future is exciting to say the least. Some of the interaction technologies are alreadyin place and accepted as natural interaction methods. For example, Nintendos Wii motion controller adopts accelerometers forusers to control virtual objects with natural motions [1].</p><p>Especially, virtual worlds, which are persistent online computer-generated environments such as Second Life, World ofWarcraft and Lineage, have potential needs for such novel interaction technology since people can interact, either for work orplay, in a manner comparable to the real world. The strong connection between the real world and the virtual world wouldprovide the immersive experience to users. Such a connection can be provided by the large-scale display where the objects inthe virtual world are represented as real world life-scale, and by natural interaction using a facial expression and body motionof the users to control an avatar which is a users representation of himself/herself or alter ego in the virtual world. Recently,Microsoft introduced Xbox Kinect which senses the full-body motion of users with a 3D-sensing camera. Due to the sensed full-body motion, the users can control a character in a game according to their own body movements. It is expected to develop theeven more precise facial expression and motion sensing technology. The virtual world services adapting the precise and natural. All rights reserved.</p><p>: 82 31 280 1711.. Han).</p><p></p></li><li><p>Fig. 1. An example of virtual world service system architecture (MPEG-V from the real world devices to the virtual world).</p><p>S. Han et al. / Signal Processing: Image Communication 28 (2013) 9711398interaction technology would provide various experiences such as a virtual tour, which enables users to travel back in time avirtual ancient Rome, and a simulated astrophysical space exploration as if users walk or fly in the enormous space.</p><p>These virtual world services require a strong connection between the virtual and the real worlds to reach simultaneousreactions in both the worlds to any changes in the environment. To make interfaces between them efficient, effective andintuitive is of crucial importance for their wide acceptance and use. The standardized interface between the real world andthe virtual world is needed for the unified interface formats in between and interoperability among virtual worlds [2].Fig. 1 shows the needs of standardization of the interface, which enables various virtual world services. The virtual worldservice providers should be able to communicate the interoperable metadata of virtual world object with the console,while the console also needs to adapt the signal received from any real world input devices to the virtual world objectmetadata and send the adapted signal to the virtual world.</p><p>MPEG-V (ISO/IEC 23005) provides such architecture and specifies the associated information representations to enablethe interoperability between virtual worlds, e.g., digital content provider of a virtual world, gaming, simulation, DVD, andwith the real world devices, e.g., sensors and actuators [3].</p><p>In this paper, we focus on one of the standardization areas of MPEG-V, real world to virtual world adaptation (R2Vadaptation). Specifically, it contains control information; interaction information and virtual world object characteristics,which are essential ingredients for controlling the virtual world objects by the real world devices. The real world devicessuch as motion sensors and cameras capture and reflect motions/posture/expressions of humans to virtual worldimplicitly. The paper presents a 6-DOF motion sensor which estimates 3D position and 3D orientation; as well as anintelligent camera which is capable of recognizing feature points of face and hand posture/gestures.</p><p>In addition, the paper also presents how the recognized output of such devices can be adapted to virtual world by R2Vadaptation engine. Presented are four different instantiations, the two of which use an intelligent camera for facialexpression cloning and hand based interaction, respectively; the other two of which uses motion sensors for 3Dmanipulation, and virtual music conducting, respectively.</p><p>This paper is organized as follows: Section 2 reviews the system architecture of MPEG-V R2V, and metadata of theMPEG-V R2V systems, i.e., control information, interaction information, and virtual world object characteristics; Section 3presents the architecture of the motion sensor based interaction of the paper and its instantiated examples, i.e., how themotion sensor can be adapted for the two instantiations; Section 4 presents the architecture of the intelligent camerabased interaction of the paper and how to adapt the received information to the specific virtual worlds. Finally, the paper isconcluded in Section 5.</p><p>2. System architecture and metadata of MPEG-V R2V</p><p>The system architecture for the MPEG-V R2V framework is depicted in Fig. 2(a) comprising an adaptation RV engineand three standardization parts: Control Information (MPEG-V part 2); Interaction Information (MPEG-V part 5); andVirtual World Object Characteristics (MPEG-V part 4). The individual elements of the architecture have the followingspecific functions.</p><p>Control Information concerns about the description of the capabilities of real world devices such as sensors and inputdevices. The control information conveys intrinsic information such as accuracy, resolution, ranges of the sensed valuefrom the real world devices.</p><p>Interaction Information specifies the syntax and semantics of the data formats for interaction devices, SensedInformation, to provide common input commands or sensor data format from any interaction devices in the real worldconnected to the virtual world. It aims to provide data formats for industry-ready interaction devices (sensors).</p></li><li><p>Fig. 2. Use scenario with MPEG-V R2V Framework. (a) System Architecture of MPEG-V R2V. (b) Body motion tracking with motion sensor and facialexpression with intelligent camera.</p><p>S. Han et al. / Signal Processing: Image Communication 28 (2013) 97113 99Virtual World Object Characteristics describes a set of metadata to characterize a virtual world object, makingpossible to migrate a virtual object from one virtual world to another and control a virtual world object in a virtual worldby real word devices.</p><p>The MPEG-V R2V supports interaction information and control information from interaction devices to a virtualworld for the purpose of controlling one or more entities in the virtual world. Particularly, consider controlling the bodymotion and facial expression of an avatar in the virtual world. The motion of avatar can be generated by either pre-recorded animation clips or direct manipulation using motion capturing devices. Fig. 2 (b) shows an example of facialexpression and body tracking application with an intelligent camera. The intelligent camera detects/tracks feature pointsof both face and body; and then analyzes the time series of the detected feature points to recognize a body gesture and/or afacial expression.</p><p>The detected feature points of the user provide the body motion and the facial expression information of the user in thereal world. To control the motion of the avatar using such information, the avatar should also have the similar featurepoints for rendering. In the simplest case, the sensed feature points of the user and the feature points of the avatar areidentical. Therefore, the description of the avatar should provide the feature points for both the body and the face of theavatar [4,5].</p><p>In order to support direct manipulation of the avatar, virtual world object characteristics contain the animationelement and the control feature element in avatar characteristics. The animation element contains a description ofanimation resources and the control element contains a set of descriptions for body control and facial control of anavatar.</p><p>In order to efficiently render the real world effect in the virtual world, the MPEG-V R2V also provides architecture tocapture and understand the current status of the real world environment. For example, the virtual world acquires thesensed temperature or light level of the room in the real world by the sensed information to render the same effect in thevirtual world.</p></li><li><p>S. Han et al. / Signal Processing: Image Communication 28 (2013) 97113100Adaptation RV engine receives the sensed information and the description of the sensor capability; and thenunderstands/adjusts the sensed information appropriately based on the sensor capability. For example, the offset, oneof the attributes in sensor description, can be added to the sensor value in order to get the correct value. The SNR (Signal toNoise Ratio), which is another attribute, can give the measure how much the data can be trusted due to the noise. Byproviding these attributes, the sensed information can be understood more precisely.</p><p>The current syntax and semantics of control information, interaction information and virtual world object character-istics are specified in [68], respectively. However, the paper provides an EBNF (Extended BackusNaur Form)-likeoverview of them due to the lack of space and the verbosity of XML [9].A. Control Information for sensors and input devicesThe Control Information Description Language (CIDL) is a description tool to provide basic structure in XML schema forinstantiations of control information tools including sensor capabilities.1) Sensor Capability Description</p><p>SensorCapabilityBaseType provides a base abstract type for a subset of types defined as part of the sensor devicecapability metadata types.</p><p>It contains an optional Accuracy element and a sensorCapabilityBaseAttributes attribute. The Accuracy describes thecloseness degree of a measured quantity to its actual value. The SensorCapabilityBaseAttributes is used to define agroup of attributes for the sensor capabilities.</p><p>The sensorCapabilityBaseAttributes may have several optional attributes, which are defined as follows: unit describesthe unit of the sensors measuring value; maxValue and minValue describe the maximum/minimum value that thesensor can perceive respectively; offset describes the value to be added to a base value in order to get to a correctvalue; numOfLevels describes the number of value levels that the sensor can perceive in between maximum andminimum value; sensitivity describes the minimummagnitude of input signal required to produce a specified outputsignal in given unit; SNR describes the ratio of a signal power to the noise power.</p><p>2) Sensor Capability VocabularyThe Sensor Capability Vocabulary (SCV) defines a clear set of actual sensor capabilities to be used with the SensorCapability Description in an extensible and flexible way. That is, it can be easily extended with new capabilityinformation or by derivation of existing capabilities thanks to the extensibility feature of XML Schema.Currently, the standard defines the capabilities of the following sensor; light, ambient noise, temperature, humidity,distance, atmospheric pressure, position, velocity, acceleration, orientation, angular velocity, angular acceleration,force, torque, pressure, motion sensor and intelligent camera. The main capabilities of all the sensors except intelligentcamera contain maxValue and minValue, which describe the maximum/minimum value that the sensor can perceive interms of the specified unit, respectively. Also, the location element, which describes the location of the sensor from theglobal coordinate system, is included in light, ambient noise, temperature, humidity, distance, and atmosphericpressure sensors.The two examples of the specific sensor capabilities, motion sensor capability and intelligent camera capability areprovided since the paper mainly concerns those two sensors for RV interaction. The motion sensor is an aggregatedsensor type which contains sensed information such as position, velocity, acceleration, orientation, angular velocity,and angular acceleration. It contains the base type as well as the capabilities of all the sensed information in the sensor.</p></li><li><p>S. Han et al. / Signal Processing: Image Communication 28 (2013) 97113 101Finally, the intelligent camera contains the base type; the description whether the camera can capture feature points onbody and/or face; the description whether the camera can recognize the facial expression and/or body gesture; themaximum number of detectable feature points; and its location of the feature points.</p><p>FeatureTrackingStatus describes whether the feature tracking...</p></li></ul>


View more >