using multimodal interaction to navigate in arbitrary ... vrml worlds ... more steadily than...

Using Multimodal Interaction to Navigate in ArbitraryVirtual VRML Worlds

Frank Althoff, Gregor McGlaun, Bjorn Schuller, Peter Morguet and Manfred LangInstitute for Human-Machine-Communication

Technical University of MunichArcisstr. 16, 80290 Munich, Germany

{althoff,mcglaun,schuller,morguet,lang}@ei.tum.de

ABSTRACTIn this paper we present a multimodal interface for navigat-ing in arbitrary virtual VRML worlds. Conventional hap-tic devices like keyboard, mouse, joystick and touchscreencan freely be combined with special Virtual-Reality hard-ware like spacemouse, data glove and position tracker. Asa key feature, the system additionally provides intuitive in-put by command and natural speech utterances as well asdynamic head and hand gestures. The commuication of theinterface components is based on the abstract formalism of acontext-free grammar, allowing the representation of device-independent information. Taking into account the currentsystem context, user interactions are combined in a semanticunification process and mapped on a model of the viewer’sfunctionality vocabulary. To integrate the continuous mul-timodal information stream we use a straight-forward rule-based approach and a new technique based on evolutionaryalgorithms. Our navigation interface has extensively beenevaluated in usability studies, obtaining excellent results.

1. INTRODUCTIONThe study of user interfaces (UIs) has long drawn signifi-

cant attention as an independent research field, especially asa fundamental problem of current computer systems is thatdue to their growing functionality they are often complex tohandle. In the course of time the requirements concerningthe usability of UIs have considerably increased[15], lead-ing to various generations of UIs from purely hardware andsimple batch systems over line-oriented and full screen inter-faces to today’s widely spread graphical UIs. As principallyrestricted to haptic interaction by mouse and keyboard, themajority of available application programs requires exten-sive learning periods and adaptation by the user to a highdegree. More advanced interaction styles, such as the useof speech and gesture recognition as well as combinationsof various input modalities, can only be seen in dedicated,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PUI 2001 Orlando, FL USACopyright 2001 ACM 0-12345-67-8/90/01 ...$5.00.

mostly research motivated applications thus far. However,especially for average computer users those systems wouldbe particularly desirable whose handling can be learned ina short time and that can be worked with quickly, easily,effectively and, above all, intuitively. Therefore current UIresearch endeavors aim at establishing computers as a com-mon tool for a preferably large group of potential users.

1.1 Motivating Multimodal VR InterfacesHuman beings are able to process several interfering per-

ceptions at a high level of abstraction so that they can meetthe demands of the prevailing situation. Most of today’stechnical systems are incapable of emulating this ability yet,although the information processing of a human being worksat a plainly lower throughput than it can be reached in mod-ern network architectures. Therefore many researchers pro-pose to apply multimodal system interfaces, as they providethe user with more naturalness, expressive power and flex-ibility[16]. Moreover, multimodal operating systems workmore steadily than unimodal ones do because they integrateredundant information shared between the individual inputmodalities. Furthermore users have the possibility to freelychoose among multiple interaction styles and thereby followtheir individual style of man-machine interaction being mosteffective in contrast to any pregiven interface functionality.

Virtual-Reality (VR) systems currently resemble the toplevel step in the development of man-machine communi-cation. Being highly immersive and interactive and thusimposing enormous constrains on hard- and software, VRevolved to a cutting-edge technology, integrating informa-tion, telecommunication and entertainment issues[3]. Sincemultimodal data handling provides the technical basis forcoordinating the various elements of a VR systems, researchin multimodal systems and VR technology is often com-bined. With regard to the analysis of human factors, a fun-damental task is designing VR interfaces consists in solvingthe problem of orientation, navigation and manipulation inthree dimensional (3D) space. In this way our work con-tributes to multimodal virtual reality research by providingan easy to use interface for navigating in arbitrary virtualVRML worlds. The user can navigate by freely combiningkeyboard, mouse, joystick and touchscreen with special VRhardware like spacemouse, data glove and position trackeras well as semantically higher-level and more intuitive in-put modalities like natural speech and dynamic head andhand gestures[7]. An impression of the multimedia workingenvironment can be taken from the photo shown in figure 1.

gue

Textfeld

From; Proc. of Workshop on Perceptive User Interfaces PUI 2001, Association for Computing Machinery

Figure 1: Multimedia working environment

1.2 Related WorkEngelmeier[5] et. al introduced a system for the visual-

ization of 3D anatomical data, derived from Magnetic Res-onance Imaging (MRI) or Computed Tomography (CT) en-abling the physician to navigate through a patient’s 3D scansin a virtual environment. They presented an easy to use mul-timodal human-machine interface combining natural speechinput, eye tracking and glove-based hand gesture recogni-tion. Using the implemented interaction modalities theyfound out that speed and efficiency of the diagnosis couldbe considerably improved.

In the course of the SGIM project Latoschik[10] et. al de-veloped an interface for interacting with multimedia systemsby evaluating both the user’s speech and gestural input. Theapproach is motivated by the idea of overcoming the physi-cal limitations of common computer displays and input de-vices. Large-screen displays (wall projections, workbenches,caves) enable the user to communicate with the machine inan easier and more intuitive way.

Pavlovic[18] demonstrated the benefits of a multimodaluser interface in a military motivated planing domain. Hespecially focused on combining the modalities speech andgestures in a complementary way achieving a more natu-ral communication form without having the need to dealwith special VR hardware. Like in many other systemsfor integrating multimodal informations contents a frame-based fusion technique[22] is used, coupled with an adaptedproblem-specific state-machine. The above mentioned threeprojects represent typical examples for the tendency towardsapplying multimodal strategies in VR-interfaces.

2. NAVIGATION INTERFACEOur navigation interface is mainly based on the function-

ality of common VRML browsers, fulfilling the VRML97specification standard[9]. Although technically possible, inour system we do not apply a third-person-view, i.e. weexplicitely do not include an avatar in the scene. Thus theuser of the VR interface only experiences what the virtualavatar would see, he feels directly set into the virtual sce-nario. The user can freely use the continuous spectrum ofboth translational and rotational movements. Changing thecurrent location and orientation is equal to moving a virtualcamera through the scene. In VRML Objects and interac-tions are defined with reference to a fixed world coordinatesystem (wcs). Navigating in this world is equal to trans-forming an avatar coordinate system within this system. As

Figure 2: Screenshot of the navigation interface

for example looking to the right is mapped on a rotation ofthe camera around the vertical axis of the viewing plane.

Handling the display of the current VRML scene context,the browser component represents the major visual frontendof the navigation interface. To cope with different targetplatforms, we have integrated both a VRML viewer runningunder the Linux operating system and another viewer run-ning under Microsoft Windows. These two systems and theaudio-visual feedback component will shortly be describedin the following. A screenshot of the overall navigation in-terface using FreeWRL is shown in figure 2.

2.1 FreeWRL viewer frontendOur first viewer component is based on FreeWRL[21].

Besides the three navigation modes WALK, FLY and EX-AMINE the browser additionally provides functions for ba-sic interaction with objects in the virtual scene, moving topredefined viewpoints, changing the display quality (light,shading, rendering etc.) as well as basic forms of manag-ing multiuser interaction. In the context of a multimodalnavigation interface we slightly changed and extended theoriginal FreeWRL functionality. Among these changes weintroduced discrete incremental movements that are splitin rotational and translational parts. This was needed ascontinuous navigation by speech is hard to realize due to amissing direct feedback. Furthermore we introduced a step-size mechanism, regulating the amount of changes as well asa repeat and an n-stage undo function.

2.2 Blaxxun Contact viewer frontendThe second viewer frontend, Blaxxun Contact[8], is a mul-

timedia communication client supporting both numerous busi-ness and entertaining applications on the internet. Compat-ible to standard WWW-browsers the viewer can simply beused as a plugin, also integrated in other customizable appli-cations. The display module directly supports 3D hardwareaccelerated graphic chipsets and thus facilitates fluent move-ments in the virtual scenarios suggesting high level of im-mersion. In contrast to FreeWRL the Blaxxun viewer fullyimplements the VRML 97 standard, among others providinggravitation and collision detection. Additional non-standardfeatures include the display of an avatar, effective display oftext messages, separatly rendered scenes and special effectslike fire, snowfall, and particle systems. The technology forinterfacing both browser modules to the individual inputdevices of our multimodal interface is described later in thetext in the context of multimodal data handling.

2.3 Audiovisual FeedbackArranged directly beneath the viewer module, our inter-

face provides a fully integrated feedback component, pre-senting feedback and status information in both visual andacoustical form. The feedback window contains text fields aswell as self-explaining graphical icons informing the user ofthe current navigation mode, the history of the last systeminteractions, and the status of the multimodal integrationprocess. Changing with reference to the current status ofthe interface, the feedback window is continuously updated.An additional acoustical feedback to each user interactionis given over the integrated loudspeaker system in form ofmodus-specific audio signals. The various parameters of thesystem feedback, e.g. the level of information and statusmessages can online be modified according to individual userneeds and preferences.

3. INPUT DEVICESTo navigate in arbitrary virtual VRML worlds our sys-

tem provides various input devices, which can be classifiedin haptic modules, special VR hardware modules as well asspeech recognition and gesture recognition modules. If tech-nically possible the individual input devices support the fullrange of possible browser functionalities, i.e. they are not re-stricted to device specific interaction forms in general. Theindividual devices are briefly described in the following.

3.1 Haptic ModulesAs shown in figure 3, the various haptic modules are clas-

sified in discrete and continuous input devices with somedevices covering nearly the complete spectrum. Keyboarduse can be combined with button interactions in the feed-back window and direct manipulation in the viewer win-dow. As keyboard interactions and window buttons are ei-ther mapped on discrete movements or directional changes,realistic continuous navigation is currently only possible inmouse and touchscreen mode.

The keyboard navigation mode can be compared to clas-sical computer action game navigation. Hotkeys are usedfor mode setting (walk, fly, etc.) and directional keys forindicating the direction of movements (left, right, up, down,etc.). Any keypress initiating a movement corresponds tocertain incremental changes of the camera location and ori-entation. Although being highly effective this kind of nav-igation requires intensive training periods. Therefore, it ismost suitable for those group of users, that are very famil-iar with both keyboard interaction and complex navigationpatterns. On the contrary, button interfaces can be used im-mediately because the graphical icons speak for themselvesillustrating their functionality, i.e. indicating directions byarrow symbols. Compared to keyboard interaction, usingthe buttons is by far not that effective but much easier tocomprehend. Therefore, buttons also provide a kind of fall-back modality. In our interface keyboard and button inter-actions currently represent strictly discrete input devices.

The most intuitive and at the same time highly efficientform of navigation is given by the third class of haptic inputdevices: direct mouse and touchscreen interaction. Areasof the screen can be touched or interacted with the mouseand the scene view is modified with reference to the cur-rent navigation mode. Depending on the current contextmouse and touchscreen can be used as both a discrete input

Figure 3: Spectrum of available haptic input devices

device, i.e. by dereferencing objects in the scene for manip-ulation or target points for a BEAM-TO navigation, and asa continuous device, i.e. by using a drag mechanism. Fi-nally, joystick support has been integrated to alternativelyprovide an easy to use input device which, in comparison tomouse and keyboard, is explicitely preferred by those usersbeing very familiar with playing computer games.

3.2 Special VR Hardware modulesAlso belonging to the class of haptic devices, specially

designed VR hardware modules provide additional systeminteraction which can be used parallel and synergisticallyto the conventional haptic devices discussed above and thesemantically higher level modalities like speech and dynamichand and head gestures.

Compromising both translational and rotational move-ments in a single compact device, the spacemouse representsthe consequent advancement of the classical mouse. As eachdegree of freedom for navigating in 3D space can be simu-lated, it is also referenced as 6DOF mouse (six degrees offreedom). Additional buttons facilitate the control of vari-ous system parameters, e.g. navigation mode, light condi-tions, etc. Same as for keyboard interaction using the space-mouse can be extremely efficient if the user is well trained.Reducing the interaction to each two degrees of freedom fortranslation and rotation results in easier handling of the de-vice also for non experienced users.

Moreover, our interface provides navigation input by aPholemus position tracker and a standard data glove. Thiscombination has already proved its usability in numerousapplications (among others in [10][5][19]). In contrast tothe 6DOF mouse, the tracker allows for intuitive navigationin 3D space with minimal training effort only. Users cansimply control the system by moving their hand. The dataglove additionally provides a set of static gestures which canbe employed to trigger specific interface commands.

3.3 Speech ModulesThe task of navigating in arbitrary virtual worlds is cog-

nitively complex. Having the plan to enable any class ofusers to operate such a system requires the design of an in-terface being as intuitive as possible. By introducing speechas a new input modality our navigation interface provides anatural form of human-machine communication[20]. Speechmoreover allows navigation without losing eye-focus on thescenario by glancing at haptical input devices and leavesone’s hands free for further operations. As with referenceto special navigation scenarios diverse users show massively

Figure 4: Architecture for recognizing speech

varying linguistical extend of verbosity. In this respect bothcommand-based speech and natural spontaneous speech ut-terances are supported.

Two different approaches providing different specializa-tions have been implemented. As shown on the left in fig-ure 4, the first recognition module is a semantic one-passstochastic top-down decoder, topped by an intention layer.Based on a system developed by Muller and Stahl[14] theacoustic layer is completely integrated in the recognitionprocess extending the search in the semantic-syntactic (SeS)layer by acoustic and phonetic sub-layers. This principledemands for a longer computing time, but delivers robustspeech-signal interpretation results given complex utteranceswithout unknown vocabulary. One further restriction is theneed of a manual segmentation.

The second recognition module, shown on the right-handside, is a two-pass decoder topped by the same intentioninterpreter. On the acoustic stages we use a commercialspeech recognition software[6] delivering recognized wordchains plus additional single-word confidence measures. Thegeneric idea of not allowing world specific vocabulary stronglyforces us to cope with out-of-vocabulary occurrences (OOVs).This is realized only in this second solution by neglecting un-certain semantic units in an intermediate stage on the basisof acoustic confidence measures. On the semantic-syntacticlayer we apply an algorithm operating basically rule-basedand strictly signal-driven. Ambiguities however are resolvedby the use of static probabilities. Regarding these as prod-uct of first-order dependencies the a-priori probability foran assumed sub-intention will be set dynamically in a nextstep, smoothly constraining the rival alternatives. This canbe achieved by integrating additional knowledge bases likea user model or by intention prediction seeing the actualsystem state and dialogue history. The approach in generalproves itself most suitable: a recognition rate on a word-chain-basis of 96.3% can be achieved with a corpus includ-ing 318 OOVs. When integrating the acoustic layers weachieved a very fast real-time recognition even with auto-matic segmentation at 67.7% signal-interpretation rate.

Using both decoders together competing against each otherin a pre-integrating unit yields very robust recognition re-sults. The effective recognition rate like this exceeds 80%as similar recognition results are mapped on the same com-mands for controlling the VRML viewer frontend. The prin-ciple of single-modal integration interpreted by different in-stances additionally allows for a confidence measure even onthe intention level.

Figure 5: Architecture for recognizing gestures

3.4 Gesture ModulesGestures often support speech in interpersonal communi-

cation. In various domains, e.g. when dealing with spa-tial problems, information can be communicated easier andmore precisely using gestures instead of describing certaincircumstances by speech. To make a step towards naturalhuman communication behavior, our interface thus providesnavigation input by dynamic hand and head gestures in ad-dition to spoken utterances.

The gesture module is mainly based on a system for therecognition of dynamic hand gestures in a visual dialoguesystem[12]. Figure 5 gives a coarse overview of the un-derlying recognition process. The central design principleconsists in using Hidden-Markov-Models (HMMs) for thetemporal segmentation and classification of the pictures ina continuous video stream. Video images are grabbed by astandard CCD camera module and transformed to yuv for-mat. For separating the hand from the background a colorhistogram based spatial segmentation method is used thatcalculates binary images solely containing the hand shape.Defined lightning conditions combined with additional low-level image processing guarantees optimal segmentation re-sults. Afterwards the spatio-temporal image sequences aretransformed into feature vectors, performing best in calcu-lating Hu moments. For the classification we are using semi-continuous HMMs with 256 prototypes and 25 states permodel. The temporal segmentation is done with a single-level, highly robust approach, however at enormous com-putational costs[13]. Basically, the classification principleconsists in continuously observing the output scores of theHMMs at every time stamp. Peaks, which appear in theindividual HMM output scores, allow to determine whichimage sequence and which gesture occurred at what time,respectively.

Totally, 41 different hand gestures can be recognized witha rate of over 95%. With an image size of 192x144 pixelsthe system is able to handle 25 images per second meetingrealtime requirements of the multimodal interface. Usabil-ity studies clearly demonstrate that visual interaction us-ing hand gestures is highly accepted specially by non-expertcomputer users and getting even better ratings when havingthe possibility to combine gestural input with speech.

The same technique used for classifying hand gestures isemployed for handling the interpretation of dynamic headgestures, currently differentiating between ten different ges-tures. Although the module is still in a prototypical status,

already more than three-quarter of the gestures in the train-ing material are classified correctly. Moreover, if clusters ofsimilar gestures are taken into account the recognition ratecan further be improved. As a special feature, our head ges-ture module integrates multimodal context knowledge andtherefore provides a goal-oriented classification scheme. Var-ious tests[1] showed that head gestures are selectively usedto support the primary mode of communication and thuscan be evaluated to obtain better recognition results of theoverall user intention.

4. MULTIMODAL DATA HANDLINGIn this work we present a generic approach for handling

multimodal information in a VRML browser which can begenerated by arbitrary and also multiple input devices. There-fore, special VR devices like data gloves, position trackerand spacemouse systems or even higher-level componentslike speech and gesture recognition modules can synergis-tically be used parallel to conventional haptic devices likekeyboard, mouse, joystick and touchscreen.

4.1 Communication FormalismFor controlling the browser we are using an abstract com-

munication formalism that is based on a context-free gram-mar (CFG). Completely describing the various functionali-ties of the browser, the grammar model facilitates the rep-resentation of domain- and device- independent multimodalinformation contents[2]. User interactions and system eventsare combined in a semantic unification process and therebymapped on an abstract model of the functionality vocab-ulary. Thus for example, both natural speech utterances,hand gestures and spacemouse interaction can be describedin the same formalism. The browser module just operateson the formal model of the grammar.

A single word of this grammar corresponds to a singlecommand or an event of the interface. Multiple words forma sentence denoting a sequence of actions, e.g. a whole ses-sion. The language defined by the grammar represents themultitude of all potential interactions. A part of a typ-ical context-free grammar we are using in our projects isgiven below. It is described in Backus-Naur form (BNF)and demonstrates a small part of the possible actions in theWALK-mode. According to that grammar a valid commandof the functionality vocabulary would be: ”walk trans for-ward”. Messages and events created by the various low- andhigh-level devices from simple keystrokes to complex naturalspeech utterances are mapped on words of this grammar.

<CMD> ::= <CONTROL> | <WALK> | <FLY> | ...

<WALK> ::= walk <WSEQ> | startwalk <WSEQ> | stop

<WSEQ> ::= <W> | <W> <WSEQ>

<W> ::= trans <ALLD> | rot <LRD>

<ALLD> ::= <ALL> | <ALL> <ALLD>

<ALL> ::= <LR> | <FB> | <UD> | <DIAG>

<UD> ::= up | down

<LR> ::= left | right

<FB> ::= forward | backward

<DIAG> ::= lfwd | rfwd | lbwd | rbwd

The set of grammar commands can be subdivided in var-ious command clusters. In general three major informa-tion blocks can be identified. Movement commands indicatedirect navigation information, position oriented commandsdenote movements to specific locations and finally control

Figure 6: Modified VRML browser

commands describe changes of the various browser parame-ters and the feedback flow.

The key feature of our approach using the CFG-moduleas a meta-device interfacing the various input devices to thebrowser is that it provides a high level access to the func-tionality of the individual recognition modules. By changingthe underlying context-free grammar the interaction behav-ior of the target application, the VRML browser, can easilybe modified also by non-experts without having to deal withthe technical specifications of the input devices. The com-mands of the grammar are comprehensible straight forwardas they inherently make sense. An additional advantage ofthis approach is that arbitrary new information sources caneasily be integrated into the interface, too.

4.2 Controlling the Browser ModuleRegarding the event handling of a standard browser im-

plementation we find a rigid event routing between the inputevents like mouse movements and key strokes and the inputhandling, that realizes the navigation. In the framework de-scribed in this paper we use the mechanism of ROUTEsto break up the rigid event routing. Additional devicescan plug in the event dispatcher module of the browserand dispatch their raw input to a DeviceSensor node in theVRML scene graph. In addition, a device can trigger certainbrowser actions by using the CFG formalism. Therefore, aTCP/IP socket backport has been implemented watchingthe net for relevant browser commands.

Figure 6 shows the structure of our modified browser mod-ule. At VRML scope, the newly defined DeviceSensor nodegives access to all kinds of conceivable user input. The Cam-era node gives superb control over the camera used for ren-dering. Moreover, by setting velocity vectors, the virtualcamera can be animated, computing the distance vector forthe current frame in consideration of previous frame timesand the results of the collision detection. Besides a 6DOFnavigation mode the camera node supports an EXAMINEand a BEAMTO mode, thus covering all conceivable nav-igation modes while hiding the nasty details of 3D mathfrom the VRML content author. However, in the absence ofa Camera, a default built-in navigation should encompass

a minimal navigation feature set allowing all basic naviga-tion actions such as the WALK, SLIDE, EXAMINE modes.Detailed information concerning concept and the implemen-tation of the browser extensions can be found in [2].

4.3 System ArchitectureIn our research endeavors, we are especially interested in

designing a generic multimodal system architecture whichcan easily be adapted to various conditions and applicationsand thus serve as a reusable basis for the development ofmultimodal interaction systems. Belonging to the class ofsystems working on a late semantic fusion level[17] our de-sign philosophy is to merge several competing modalities(e.g. two or more speech modules, dynamic and static ges-ture modules as well as avrious haptic modules), not neces-sarily known in advance.

The individual components of our interface communicatevia TCP/IP sockets in a blackboard architecture. For theimplementation we propose a classical client-server approach.Physically the integrator functions as a central broadcastserver, sending incoming messages to each of the connectedclients, i.e. recognition modules, viewer and feedback appli-cations and other modules integrated in the interface. Ourarchitecture explicitely facilitates the integration of variousagents in a shared environment, neither bound to any spe-cific software nor operating system. Figure 7 gives a struc-tural overview of our system.

The input modules provide recognition results which aretranslated by CFG-wrappers with regard to the above de-scribed grammar formalism. Afterwards they are sent to theintegrator which interpretes the continuous data stream ofthe individual components and generates commands match-ing the user’s intention. Finally commands are sent to theviewer and feedback component. The components of our in-terface communicate bilaterally with the integrator, i.e. theviewer sends backs status messages and potential internalerror states. To improve recognition results the integratorshares its omnipotent context knowledge with the recogni-tion modules.

All of the messages obey the same format, including uniquetarget id, source id, timestamp and message content. Eachmodule has an own parser component, checking if incomingmessages on the socket are correct and designated for themodule. Thus the parser also functions as a message filter,as not all messages are of interest for every module.

We have already successfully been porting the underlyingarchitecture to other domains. One of the projects which isdone in cooperation with major automobile industry part-ners deals with the multimodal operation of an MP3, radioand telephone interface in an automobile environment.

4.4 Integration ProcessThe integrator module interpretes the incoming multi-

modal information stream. It consists of different coworkingcomponents, implemented as independent threads commu-nicating via shared memory. By following an object orientedapproach the design of the integrator system is flexible, ex-tensible, easy to maintain, and above all highly robust.

The parser checks incoming and outgoing messages forcorrectness, deciding if they are valid in the current systemcontext. Containing meta-knowledge of the application, therecognition modules and the integration unit, the state ma-chine provides the database for the integration process. The

Figure 7: Overview of the system architecture

most important component within the integrator is the in-tention decoder, generating commands for the viewer. Sen-tinels help to supervise the data stream, watching for redun-dant, complementary and concurrent incoming data. Addi-tionally a dialog management module initiates status anderror resolving dialogs.

To integrate the various information contents of the indi-vidual input modalities we mainly use a classical straight-forward rule-based approach, well established in various re-search applications (e.g. in [4][22]). The semantic unifica-tion is carried out with reference to the CFG, modeling de-vice independent user interactions. Single elements, wordsand subwords of the grammar, are migrated and combinedby evaluating a problem-specific look-up table, facilitatingthe multimodal integration process in realtime. Comple-mentary information contents are most easy to handle asthe individual grammatical elements support each other tocreate a valid browser command. Redundancy is handled bysupervising the multimodal information stream and associ-ating various events, that occur within a context sensitivetiming window, with the same system command. Moreover,concurrent (competing) information contents are detectedwhen obtaining conflicts in the look-up table indicated bynegative scores for the overall user intention.

As a second approach to integrate multimodal user in-put we introduce a new, stochastically motivated technique.By employing biological inspired evolutionary strategies thecore algorithm compromises a population of individual so-lutions to the current navigation context and a set of oper-ators defined over the population itself. Various integrationresults are competing against each other. According to evo-lutionary theories, only the most suited elements in a popu-lation are likely to survive and generate offspring, transmit-ting their biological heredity to new generations and thuslead to stable and robust solutions, i.e. result in the correctinterpretation of the user intention.

Figure 8: Distribution of unimodal (left) and multimodal (right) system interactions

Following a technique described in [11] our genetic al-gorithm is composed of an abstract data type and non-standard recombination operators that are specially designedto suit the problem at hand. A potential solution consists ofthree chromosome parts. The administration chromosomecontains information about the pre-command (with exacttime and certainty), alternative pre-commands of previousintegration steps, mutation possibilities and potential fol-lowing commands as well as matching partners for comple-mentary interactions. The command chromosome containsthe generated commands, i.e. a word of the grammar for-malism with a specific confidence measure and informationabout necessary additional commands. Finally a numberof data chromosomes, one for each input device, compro-mise detailed information on the recognized semantic units,recognition times, certainties as well as lists of complemen-tary, supplementary and competing information contents.Both crossover and mutation techniques are applied to ob-tain optimal results. The fitness of an individual, measuringthe certainty and the confidence of a generated command,is calculated according to a statistically weigthed schemeincluding the various information resources.

The key feature of this new technique is that the funda-mental aspects of multimodal integration (redundancy, com-plementarity and concurrency) do not have to be handledas special cases because they are implicitly modeled by thestructure of the evolutionary algorithm. As a big advantagethe approach facilitates an easy integration of additional in-put modules without having to modify the integration algo-rithm itself. Finally the algorithm is extremely robust withregard to changing boundary conditions.

5. USER STUDYWe extensively evaluated the multimodal navigation inter-

face in usability studies[1]. A total of 40 test persons par-ticipated in the test, which was separated in two series with17 persons using a standard mouse as the primary point-ing device and 23 persons using touchscreen interaction in-stead. A small control group of three persons participatedin both test series. The individual test subjects were dividedinto three groups (beginners, normal computer users and ex-perts) according to a weighted classification scheme of var-ious answers given in the first question form. In the courseof the test, the participants were asked to, but not forced touse combinations of mouse, touchscreen, keyboard, speechas well as hand and head gestures. The test was separatedin five blocks evaluating the performance of four prescribedcombinations and free interaction.

The user study has partly been realized according to the’Wizard-Of-Oz (WOO)’ test paradigm. In contrast to hapticinteractions that are directly transcribed by the system, thehand, head and speech recognition has been simulated by ahuman person supervising the test subjects via audio- andvideo-signals and generating the appropriate browser com-mands. Although we have already implemented and testedthe individual input modules, the WOO scenario has beenchosen to cope with different running times and potentialfailures of the recognition modules. As we were mostly in-terested in natural interaction forms, we did not want thetest results to be dominated by these factors.

The distribution of the various system interactions is re-lated to the navigation tasks, which - in our case - werequite simple. Commands merely consisted of navigationmodes (walk, fly, examine), functions (trans, rot, twist, etc.)and values (left, up, forward, etc.). Beginners significantlyshowed more full commands, containing all informationsslots, than the other groups. Concerning the distributionof unimodal commands (shown on the left-hand side of fig-ure 8), obviously all users favored keyboard input. For nextbest choice experts and normal users clearly preferred themouse, whereas beginners employed speech and hand ges-tures.

Although only applied in about one fifth of all interac-tions, multimodal commands symbolized the core interac-tion style as they were particularly used to change naviga-tion contexts, i.e. switching from translational to rotationalmovements. For all gropus the use of speech dominatedthe distribution of multimodal interactions (shown on theright-hand side of figure 8). Detailed analysis clearly provedthat with growing complexity the use of multimodal inter-action increased. Regarding the distribution of redundant,competing and complementary interactions, the latter oc-cured most often, in the case of beginners very often cou-pled with additional redundancy. When combining multipleinput modalities, especially beginners favored gestures andspeech. The other two groups strongly combined speech (forsetting mode and functions) with haptic devices to indicatedirections.

A very interesting result was that although people gener-ally stated that they did not like head gestures very much,unconsciously more than 80% used their heads to supporttheir primary communication modality. In general peoplehighly enjoyed having the possibility to freely choose amongmultiple input devices. The results of the usability studyprovided the basis for adapting the individual recognitionmodules and designing appropriate integration algorithms.

6. FUTURE WORKFor the nearest future we further plan to improve our in-

terface. First and foremost we plan to integrate a balancedstrategy to switch between discrete and continuous naviga-tion patterns. Handling both input streams in separatedunits - synchronized by a periodic update cycle - promisesmajor improvements in system usability. Second the cur-rent communication formalism is to be extended to bettersupport the individual advantages of the various modalities.Third we want to improve the underlying communicationarchitecture by porting the system to a CORBA based com-munication message layer, supporting platform independentintegration and recognition modules. Moreover, we want tointegrate additional input devices as usability studies clearlydemonstrated that different people show strongly varyinginteraction patterns and with more available input possibil-ities each user has the freedom to apply those devices beingmost intuitive and effective for him. Finally we plan to in-tegrate error management and adaptive help strategies forsupporting also non-experienced users.

7. CONCLUSIONIn this paper we proposed a multimodal interface for nav-

igating in arbitrary virtual worlds. Meeting realtime con-ditions the user can navigate by keyboard, mouse, touch-screen, joystick, as well as special VR hardware like 6DOFmouse, data glove and position tracker and, as a key fea-ture, by natural and command speech and dynamic headand hand gestures. We provided a generic architecture forthe development of multimodal interaction systems. Basedon the abstract formalism of a context-free grammar the in-dividual system modules communicate with each other via acentral integrator, comprising meta-knowledge of the appli-cation, the recognition modules and the integration unit it-self. To handle the multimodal information stream we useda rule-based approach and additionally introduced a newtechnique by employing evolutionary strategies. Finally wepresented some results of a usability study and identifiedpotential extensions to our multimodal interface.

8. REFERENCES[1] F. Althoff, G. McGlaun, and M. Lang. Combining

multiple input modalities for VR navigation - A userstudy. In 9.th Int. Conf. on HCI, August 2001.

[2] F. Althoff, T. Volk, G. McGlaun, and M. Lang. Ageneric user interface framework for VR applications.In 9.th Int. Conf. on HCI, New Orleans, August 2001.

[3] H. J. Bullinger. Virtual reality as a focal pointbetween new media and telecomunication. VR World1995 - Conference Documentation, IDG 1995.

[4] A. Cheyer and L. Julia. Designing, developing andevaluating multimodal applications. In WS onPen/Voice Interfaces (CHI 99), Pittsburgh 1999.

[5] K.-H. Engelmeier et al. Virtual reality and multimediahuman-computer interaction in medicine. IEEE WSon Multimedia Signal Processing, pages 88–97, LosAngeles, December 1998.

[6] Lernout & Houspie Speech Products N.V. Lernout &Houspie - Software Developers Kit, 1998.

[7] Details of the MIVIS system (October 2001).Internet-Publication, http://www.mivis.de.

[8] Developer site of blaxxun interactive (July 2001).http://www.blaxxun.com/developer/contact/3d.

[9] Specification of VRML 97. ISO/IEC 14772-1:1997,http://www.web3d.org/technicalinfo/specifications/vrml97/index.html, July 2001.

[10] M. Latoschik et al. Multimodale interaktion mit einemsystem zur virtuellen konstruktion. Informatik ’99, 29.Jahrestagung der Gesellschaft fr Informatik,Paderborn, pages 88–97, October 1999.

[11] Z. Michalewicz. Genetic Algoithms and DataStructures. Springer-Verlag, New York, 1999.

[12] P. Morguet. Stochastische Modellierung vonBildsequenzen zur Segmentierung und Erkennungdynamischer Gesten. PhD thesis, Technical Universityof Munich, Germany, Januar, 2001.

[13] P. Morguet et al. Comparison of approaches tocontinuous hand gesture recognition for a visual dialogsystem. Proc. of ICASSP 99, pages 3549–3552, 1999.

[14] J. Muller and H. Stahl. Speech understanding andspeech translation in various domains by maximuma-posteriori semantic decoding. In Proc. EIS 98, pages256–267, La Laguna, Spain 1998.

[15] J. Nielsen. Usability Engineering. Morgan KaufmannPublishers Inc., 1993.

[16] S. L. Oviatt. Ten myth of multimodal interaction.Communications of the ACM, 42(11):74–81, 1999.

[17] S. L. Oviatt. Multimodal interface research: A sciencewithout borders. Proc. of 6th Int. Conference onSpoken Language Processing (ICSLP 2000), 2000.

[18] V. Pavlovic, G. Berry, and T. Huang. BattleView: Amultimodal hci research application. In Workshop onPerceptual User Interfaces (PUI 98), November 1998.

[19] I. Poupyrev, N. Tomokazu, and S. Weghorst. VirtualNotepad: Handwriting in immersive VR. Proc. ofIEEE Virtual Reality Annual InternationalSymposium’98 (VRAIS’98), pages 126–132, 1998.

[20] B. Schuller, F. Althoff, G. McGlaun, and M. Lang.Navigating in virtual worlds via natural speech. In9.th Int. Conf. on HCI, New Orleans, August 2001.

[21] J. Stewart. FreeWRL homepage. Internet-Publication,http://www-ext.crc.ca/FreeWRL, June 2001.

[22] A. Waibel, M. T. Vo, P. Duchnowski, and S. Manke.Multimodal interfaces. Artificial Intelligence Review,10(3-4):299–319, August 1995.

using multimodal interaction to navigate in arbitrary ... vrml worlds ... more steadily than...

Documents