generation and evaluation of communicative robot …...robot (fig. 1). based on the implementation...

Noname manuscript No.(will be inserted by the editor)

Generation and Evaluation of Communicative Robot Gesture

Maha Salem · Stefan Kopp · Ipke Wachsmuth · Katharina Rohlfing ·Frank Joublin

Received: date / Accepted: date

Abstract How is communicative gesture behavior inrobots perceived by humans? Although gesture is cru-cial in social interaction, this research question is stilllargely unexplored in the field of social robotics. Thusthe main objective of the present work is to shed lightonto how gestural machine behaviors can ultimately beused to design more natural communication in socialrobots. The chosen approach is twofold. Firstly, thetechnical challenges encountered when implementing aspeech-gesture generation model on a robotic platformare tackled. We present a framework that enables theHonda humanoid robot to flexibly produce syntheticspeech and co-verbal hand and arm gestures at run-time, while not being limited to a predefined reper-toire of motor actions. Secondly, the achieved flexibil-ity in robot gesture is exploited in controlled experi-ments. To gain a deeper understanding of how commu-nicative robot gesture might impact and shape humanperception and evaluation of human-robot interaction,we conducted a between-subjects experimental study

M. SalemResearch Institute for Cognition and RoboticsBielefeld, GermanyE-mail: [email protected]

S. KoppSociable Agents Group, Bielefeld University, GermanyE-mail: [email protected]

I. WachsmuthArtificial Intelligence Group, Bielefeld University, GermanyE-mail: [email protected]

K. RohlfingEmergentist Semantics Group, Bielefeld University, GermanyE-mail: [email protected]

F. JoublinHonda Research Institute Europe, Offenbach, GermanyE-mail: [email protected]

using the Honda humanoid robot in a joint task sce-nario. We manipulated the non-verbal behaviors of thehumanoid robot in three experimental conditions, sothat it would refer to objects by utilizing either (1)unimodal (i.e. speech only) utterances, (2) congruentmultimodal (i.e. semantically matching speech and ges-ture) or (3) incongruent multimodal (i.e. semanticallynon-matching speech and gesture) utterances. Our find-ings reveal that the robot is evaluated more positivelywhen non-verbal behaviors such as hand and arm ges-tures are displayed along with speech, even if they donot semantically match the spoken utterance.

Keywords Multimodal Interaction and Conversa-tional Skills · Non-verbal Cues and Expressiveness ·Social Human-Robot Interaction · Robot Companionsand Social Robots

1 Introduction

One of the main objectives of social robotics researchis to design and develop robots that can engage in so-cial environments in a way that is appealing and fa-miliar to human interaction partners. However, inter-action is often difficult because inexperienced users donot understand the robot’s internal states, intentions,actions, and expectations. Thus, to facilitate successfulinteraction, social robots should provide communicativefunctionality that is both natural and intuitive. Theappropriate level of such communicative functionalitystrongly depends on the appearance of the robot andattributions thus made to it. Given the design of hu-manoid robots, they are typically expected to exhibithuman-like communicative behaviors, using their bod-ies for non-verbal expression just as humans do. Rep-resenting an important feature of human communica-

2 Maha Salem et al.

tion, co-verbal hand and arm gestures are frequentlyused by human speakers to illustrate what they ex-press in speech [24]. Crucially, gestures help to con-vey information which speech alone cannot provide, asin referential, spatial or iconic information [11]. At thesame time, human listeners have been shown to be well-attentive to information conveyed via such non-verbalbehaviors [7]. Moreover, providing multiple modalitieshelps to dissolve ambiguity typical of unimodal commu-nication and, as a consequence, to increase robustnessof communication. Thus it appears reasonable to equiphumanoid robots that are intended to engage in natu-ral and comprehensible human-robot interaction withspeech-accompanying gestures.

1.1 Gesture in Human Communication

Gesture is a phenomenon of human communication thathas been studied by researchers from various disciplinesfor many years. A multiplicity of hand, arm and bodymovements can all be considered to be gestures, andalthough definitions and categorizations vary widely,much gesture research has sought to describe the differ-ent types of gesture, e.g. [24,15]. McNeill [24], for ex-ample, categorizes four main types of gesture based onsemiotics: (1) iconics, i.e. gestures representing imagesof concrete entities and/or actions; (2) metaphorics, i.e.gestures whose pictorial content presents abstract ideasrather than concrete objects; (3) deictics, i.e. pointinggestures; and (4) beats, i.e. hand movements performedalong with the rhythmical pulsation of speech withoutconveying semantic information. In his later work, how-ever, McNeill [25] claims that the search for categoriesactually seems misled: since the majority of gestures aremultifaceted, it is more appropriate to think in terms ofcombinable dimensions rather than categories. In thisway, dimensions can be combined without the need fora hierarchy. Unlike task-oriented movements like reach-ing or object manipulation, human gestures are partlyderived from an internal representation of ‘shape’ [17],which particularly applies to iconic or metaphoric ges-tures. Such characteristic shape and dynamical prop-erties enable humans to distinguish gestures from sub-sidiary movements and to perceive them as meaningful[42].

In this paper, we use the term gesture to refer specif-ically to representational gestures [12], i.e. movementsthat co-express the content of speech by pointing to areferent in the physical environment (deictic gestures)or gestures depicting a referent with the motion or shapeof the hands (iconic gestures). Other types of gesturesuch as beat gestures (movements that emphasize theprosody or structure of speech), emblems (movements

that convey conventionalized meanings) and turn-takinggestures (movements that regulate interaction betweenspeakers) fall outside the scope of the present work.

1.2 Gesture Behavior for Artificial Communicators

To endow a humanoid robot with communicative co-verbal gestures, it requires a large degree of flexible con-trol especially with regards to shape properties of thegesture. At the same time, adequate timing and natu-ral appearance of these body movements are essentialto add to the impression of the robot’s liveliness. Sincethe challenge of multimodal behavior realization for ar-tificial humanoid bodies has already been explored inthe context of virtual conversational agents, our ap-proach builds upon an existing solution from this re-search area [42,35]. The Articulated Communicator En-gine (ACE) [18] implements the speech-gesture produc-tion model that was originally designed for the virtualhuman agent Max and is now used as the underlyingaction generation framework for the Honda humanoidrobot (Fig. 1). Based on the implementation of sucha speech and gesture production model for humanoidrobot gesture [36], we exploit the achieved flexibility incommunicative robot behavior in a controlled experi-mental study to investigate how humans experience ahumanoid robot that performs gestures during interac-tion. This way, we try to shed light onto human percep-tion and understanding of gestural machine behaviorsand how these can be used to design more natural com-munication in social robots.

The rest of this paper is organized as follows. Wefirst discuss related work in Section 2, showing that notmuch research has focused on the generation and eval-uation of robot gesture. In Section 3, we describe ourmultimodal behavior realizer, the Articulated Commu-

Fig. 1 The goal of the present work is to realize speech andnon-verbal behavior generation for the physical Honda hu-manoid robot (right) by transferring an existing virtual agentframework as employed for the agent Max (left) and to subse-quently evaluate it in controlled experiments of human-robotinteraction.

Generation and Evaluation of Communicative Robot Gesture 3

nicator Engine (ACE), which implements the speech-gesture production model originally designed and im-plemented for the virtual human agent Max and is nowused for the Honda humanoid robot (Fig. 1). We thendescribe our approach to a robot control architectureemploying ACE for producing gestural hand and armmovements for the humanoid robot in Section 4. Sub-sequently, gesture representations realized in our con-troller framework are presented, evaluated and discussedin Section 5. We further describe the empirical studyconducted to evaluate robot gesture in a human-robotinteraction scenario and present an evaluation and dis-cussion of results in Section 6. Finally, we conclude andgive an outlook of future work in Section 7.

2 Related Work

Two research areas are relevant to the present work:firstly, in the area of computer animation, researchershave developed frameworks to realize multimodal com-munication behavior in virtual conversational agents;secondly, in the field of robotics, researchers have ex-plored various approaches to generate non-verbal be-haviors along with speech in humanoid robots. Thechallenges are similar in that both research areas de-mand a high degree of control and flexibility so thathuman-like motion can be adapted to a system withnon-human kinematics. The levels of complexity en-countered in each field, however, are not equivalent.Although the range of different body types found invirtual embodied agents is manifold and hence chal-lenging, character animation has less restrictive motionthan even the most state-of-the art humanoid robots[33]. For example, animation of virtual agents reducesor even eliminates the problems of handling joint andvelocity limits; in a robot body, however, these have tobe explicitly addressed given real physical restrictions.

2.1 Virtual Agents

In contrast to the research field of robotics, the chal-lenge of generating speech and co-verbal gesture has al-ready been tackled in various ways within the domainof virtual human agents. Some of the earliest work in-cludes that of Cassell et al. who presented the REAsystem [5] in which a conversational humanoid agentoperates as a real estate salesperson. A more recent ap-proach is that of the interactive expressive system Greta[31] which is able to communicate using verbal and non-verbal modalities. Even in the domain of virtual conver-sational agents, however, most existing systems simplifymatters by using lexicons of words and canned non-verbal behaviors in the form of pre-produced gestures

[9]. In contrast, the ACE framework underlying the vir-tual agent Max [18] builds upon an integrated architec-ture in which the planning of both content and formacross both modalities is coupled [16], thereby takinginto account the meaning conveyed in non-verbal utter-ances. For this reason, our proposed approach benefitsfrom transferring a sophisticated multimodal behaviorscheduler from a virtual conversational agent to a phys-ical robot.

In addition to the technical contributions presentedin the area of embodied conversational agents, therehas also been active work in evaluating complex ges-ture models for the animation of virtual characters.Several studies have investigated and compared the hu-man perception of traits such as naturalness in virtualagents. In one such study [19], the conversational agentMax communicated by either utilizing a set of co-verbalgestures alongside speech, typically by self-touching ormovement of the eyebrows, or by utilizing speech alonewithout any such accompanying gestures. Human par-ticipants were then invited to rate their perception ofMax’s behavioral-emotional state, for example, its levelof aggressiveness, its degree of liveliness, etc. Crucially,the results of the study suggested that virtual agents areperceived in a more positive light when they are ableto produce co-verbal gestures alongside speech (ratherthan acting in a speech-only modality). In [2] Bergmannet al. modeled the gestures of Max based on real hu-mans’ non-verbal behavior and subsequently set out toquestion the communicative quality of these models viahuman participation. The main finding was that Maxwas perceived as more likable, competent and human-like when gesture models based on individual speakerswere applied, as opposed to combined gestures of a col-lection of speakers, random gestures, or no gestures.

2.2 Robotics

Although much of the robotics research has been ded-icated to the area of gesture recognition and analysis,only few approaches have pursued both the generationof humanoid robot gesture and the investigation of hu-man perception of such robot behavior. Within the fewexisting approaches that are actually dedicated to ges-ture synthesis, the term “gesture” has been widely usedto denote object manipulation tasks rather than non-verbal communicative behaviors. For example, Calinonand Billard [4] refer to the drawing of stylized alphabetletters as gestures in their work. Many researchers havefocused on the translation of human motion for gesturegeneration in various robots, usually aiming at imita-tion of movements captured from a human demonstra-tor, e.g. [3]. Miyashita et al. [27] and Pollard et al. [33]

4 Maha Salem et al.

present further techniques for limiting human motionof upper body gestures to movements achievable by avariety of different robotic platforms. These models ofgesture synthesis, however, mainly focus on the tech-nical aspects of generating robotic motion that fulfillslittle or no communicative function. In addition, theyare limited in that they do not combine generated non-verbal behaviors with further output modalities such asspeech.

Only a few approaches in robotics incorporate bothspeech and gesture synthesis; however, in most casesthe robots are equipped with a set of pre-recorded ges-tures that are not generated on-line but simply replayedduring human-robot interaction, as seen in [8] or [40].Moreover, a majority of approaches focusing on gesturesynthesis for humanoid robots are limited to the imple-mentation and evaluation of a single type of gesture,typically deictic (e.g. [41], [32]) or emblematic gestures(e.g. [13]) instead of providing a general framework thatcan handle all types of gesture. The communicationrobot presented in [1] is one of the few systems in whichdifferent types of gesture are actually generated on-line.These mainly consist of arm movements and pointinggestures performed synchronously with eyes, head, andarms, and are accompanied by speech to make the robotappear livelier. However, all aforementioned approachesare realized on platforms with less complex robot bod-ies which, for example, comprise fewer degrees of free-dom (DOF), have limited mobility, and perform bodymovements in a rather jerky fashion (as seen in [1]).Moreover, many of these robots expose only little or nohumanoid traits.

As stated in [26], however, the appearance of a robotcan be just as important as its behavior when evaluat-ing the experience felt by human interaction partners.In other words, the robot’s design is crucial if we areto eventually study the effect of robot gesture on hu-mans. MacDorman and Ishiguro [22] have researchedhuman perception of robot appearance as based on dif-ferent levels of embodiment, with android robots repre-senting the most anthropomorphic form. Although aninnovative approach, android robots only feature cer-tain hard-coded gestures and thus still lack any real-time gesture-generating capability. Moreover, findingspresented in [38] suggest that the mismatch betweenthe highly human-like appearance of androids and theirmechanical, less human-like movement behavior maylead to increased prediction error in the brain, possiblyaccounting for the ‘uncanny valley’ phenomenon [28].Thus a major advantage of using the Honda humanoidrobot as a research platform lies in its humanoid, yetnot too human-like appearance and smooth, yet notcompletely natural movement behavior. Although the

Honda robot cannot mimic any facial expression, it isfavorable for us to use such a robot, as the focus of thepresent work lies in hand and arm gestures. This way,the perception of the robot’s gestural arm movementscan be assessed as the primary non-verbal behavior.

2.3 Evaluation of Robot Gesture

Despite the interesting implications of the evaluationstudies conducted with virtual agents, we must be cau-tious when transferring the findings from the domain ofanimated graphical characters to the domain of socialrobots. Firstly, the presence of real physical constraintscan alter the perceived level of realism. Secondly, giventhe greater degree of embodiment that is possible in areal-world system, interaction with a robot is poten-tially richer; human participants could, for example,walk around or even touch a real robot. This makes theinteraction experience more complex and is naturallyexpected to affect the outcome of the results.

One of the few models that resembles our approachin that it attempts to generate and evaluate a multitudeof gesture types for the Honda humanoid robot waspresented by Ng-Thow-Hing et al. [30]. Their proposedmodel reconstructs the communicative intent throughtext and parts-of-speech analysis to select appropriategestures. The evaluation of the system, however, wasmerely undertaken using several video-based studies.

We argue that, in order to obtain a representativeassessment of robot gesture and the human perceptionthereof, it is necessary to evaluate such non-verbal be-havior in actual interaction scenarios. As gesture scopeand space can only be accurately observed and assessedin a true interaction, we decided to conduct an experi-mental study using our speech-gesture synthesis modelimplemented on the Honda humanoid robot. Since theevaluation of the effects and acceptance of communica-tive robot gesture is still largely unexplored, we attemptto investigate whether multimodal robot behavior, i.e.displaying gesture along with speech, is desired by hu-man interaction partners and favored over unimodalcommunication.

3 Integrated Model of Speech-GestureProduction

Computational approaches to synthesizing multimodalbehavior can be modeled as three consecutive tasks [34](Fig. 2): firstly, determining what to convey (i.e. contentplanning); secondly, determining how to convey it (i.e.behavior planning); finally, conveying it (i.e. behaviorrealization). Addressing the third task of this behavior


Fig. 2 Behavior generation pipeline adapted from Reiter andDale [34,36].

generation pipeline, the Articulated Communicator En-gine (ACE) operates at the behavior realization layer,yet the overall system used by the virtual agent Maxalso provides an integrated content planning and behav-ior planning framework [16]. The present work focuseson ACE which forms the starting point for an interfaceendowing the humanoid robot with similar multimodalbehavior.

3.1 Utterance Specification

Within the ACE framework, utterance specificationscan be described in two different ways using the Mul-timodal Utterance Representation Markup Language(MURML [20]). Firstly, verbal utterances together withco-verbal gestures can be specified as feature-based de-scriptions in which the outer form features of a gesture(i.e. the posture of the gesture stroke) are explicitlydescribed. Gesture affiliation to dedicated linguistic ele-ments is determined by matching time identifiers. Fig. 3illustrates an example of a feature-based MURML spec-ification for speech-gesture production. Secondly, ges-

Fig. 3 A feature-based MURML specification for multi-modal utterances.

tures can be specified as key-frame animations in whicheach key-frame specifies a part of the overall gesturemovement pattern describing the current state of eachjoint. Speed information for the interpolation betweenevery two key-frames and the corresponding affiliationto parts of speech is obtained from assigned time iden-tifiers. Key-frame animations in ACE can be definedeither manually or derived from motion capturing datafrom a human demonstrator, allowing the animation ofvirtual agents in real-time. In our present work we focuson the generation of feature-based utterance descrip-tions, although key-frame animations – and therewithcaptured human motion – can also be realized on therobot using the same interface.

3.2 Gesture Motor Control

Gesture motor control is realized hierarchically in ACE:during higher-level planning, the motor planner is pro-vided with timed form features as annotated in theMURML specification. This information is then passedon to independent motor control modules. The idea be-hind this functional-anatomical decomposition of mo-tor control is to break down the complex control prob-lem into solvable sub-problems. ACE [18] provides spe-cific motor planning modules for the arms, the wrists,and the hands which, in turn, instantiate local motorprograms (LMPs). These are used to animate requiredsub-movements. LMPs operate within a limited set ofDOF and over a designated period of time. For themotion of each limb, an abstract motor control pro-gram (MCP) coordinates and synchronizes the concur-rently running LMPs, gearing towards an overall solu-tion to the control problem. The top-level control ofthe ACE framework, however, does not attend to howsuch sub-movements are controlled. To ensure an ef-fective interplay of the LMPs involved in a MCP, theplanning modules arrange them in a controller networkwhich defines their potential interdependencies for mu-tual (de-)activation. LMPs are able to transfer activa-tion between themselves and their predecessors or suc-cessors to allow for context-dependent gesture transi-tions. Thus they can activate or deactivate themselvesat run-time depending on feedback information on cur-rent movement conditions.

3.3 Speech Synthesis

Speech output is generated using the open source text-to-speech synthesis system MARY (Modular Architec-ture for Research on speech sYnthesis) [39]. It features

6 Maha Salem et al.

a modular design and an XML-based internal data rep-resentation. Numerous languages including English andGerman are supported. A number of settings allow foran adjustment of various voice features. For further de-tails on MARY see [39].

3.4 On-line Scheduling of Multimodal Utterances

The concept underlying the multimodal productionmodel acts on an empirically suggested assumption [24]referred to as a segmentation hypothesis [18]. It claimsthat the production of continuous speech and gestureis organized in successive segments. Each of these seg-ments represents a single idea unit referred to as a chunkof speech-gesture production. A chunk, in turn, con-sists of an intonation phrase and a co-expressive ges-ture phrase, concertedly conveying a prominent con-cept. Levelt [21] defines intonation phrases to repre-sent units over which the phonological structure of con-tinuous speech is organized. With respect to gestures,Kendon [14] describes gesture phrases as units of ges-tural movement comprising one or more subsequentphases: preparation, stroke, retraction, hold.

Accordingly, in our model incremental productionof successive coherent chunks is realized by processingeach chunk on a separate ‘blackboard’ running througha sequence of states (Fig. 4). Timing of gestures isachieved on-line by the ACE engine as follows. Withina chunk, synchrony is generally achieved by adaptingthe gesture to structure and timing of speech. To dothis, the ACE scheduler retrieves timing informationabout the synthetic speech at the millisecond level anddefines the start and the end of the gesture stroke ac-cordingly. These temporal constraints are automaticallypropagated down to each single gesture component. Amore detailed overview of the internal planning processwithin ACE can be found in [18]. The second aspectof scheduling, namely, the decision to skip preparation

Fig. 4 Blackboards running through a sequence of process-ing states for incremental production of multimodal chunks[18].

or retraction phases, results from the interplay of mo-tor programs at run-time. Motor programs monitor thebody’s current movement state and are autonomouslyactivated to realize the planned gesture stroke as sched-uled. Whenever the motor program of the following ges-ture takes over the control of the effectors from thepreceding program, the retraction phase turns into atransition into the next gesture. Such on-line schedulingresults in fluent and continuous multimodal behavior.

4 Robot Control Architecture

In an effort to enable a humanoid robot to flexibly pro-duce speech and co-verbal gesture at run-time, a givenrobot control architecture needs to combine conceptualrepresentation and planning provided by ACE with mo-tor control primitives for speech and arm movements forthe robot. This, however, poses a number of challengesincluding the capacity to adequately account for certainphysical properties, e.g. motor states, maximum jointvelocity, strict self-collision avoidance, and variation inDOF. In light of ACE being originally designed for avirtual rather than physical platform, these challengesmust be met when transferring the ACE framework tothe Honda humanoid robot, whose upper body com-prises a torso with two 5DOF arms and 1DOF hands,as well as a 2DOF head [10].

Although ACE provides movement descriptions injoint space to animate the body of a virtual agent,we only extract task space information when generat-ing the corresponding robot trajectory. This allows usto circumvent the correspondence problem [29], whicharises due to body dissimilarity when mapping move-ments from one agent’s body to a different agent’s body.The information obtained at the task space level in-cludes the wrist position and orientation as well as thedesignated hand shape, which is forwarded to the robotmotion control module to instantiate the actual robotmovement. Problematically, given the small number ofDOF in its hands, the humanoid robot is more lim-ited in performing single finger movements than a vir-tual character. We counter this limitation by specify-ing three basic hand shapes that can be utilized by therobot. A variety of finger constellations derived from theACE body model can then be mapped onto them. Handgestures in which the hands are open or closed, andpointing gestures are directly transferable. Any handgesture employing more than the index finger is mod-eled using an open hand shape. Fig. 5 displays the threedifferent hand shapes used for hand gesture generationon the Honda humanoid robot.

The problem of inverse kinematics (IK) of the arm issolved on the velocity level using the robot’s whole body


Fig. 5 Different hand shapes used for hand gesture genera-tion on the Honda humanoid robot [36].

motion (WBM) controller framework [6]. The WBMframework allows to control all DOF of the humanoidrobot based on given end-effector targets, providing aflexible method to control upper body movement byonly specifying relevant task dimensions selectively inreal-time, yet while generating smooth and naturalmovement. Redundancies are optimized with regard tojoint limit avoidance and self-collision avoidance. Formore details on WBM control for the Honda humanoidrobot see [6].

After solving inverse kinematics for the internalbody model provided for WBM control, the joint spacedescription of the designated trajectory is applied to thereal robot. A bi-directional interface using both efferentactuator control signals and afferent sensory feedback isused to monitor possible deviations of actual robot mo-tor states from the kinematic body model provided byACE. This is realized by a feedback loop that updatesthe internal model of the robot in the WBM controlleras well as the kinematic body model coupled to ACEat a sample rate r. This process synchronizes two com-peting sample rates in order that successful integrationcan ensue: firstly, that of the ACE engine, and secondly,that of the WBM software controlling the robot. Forthis purpose, a number of alternative mapping ratescould be employed:

1. sampling only at target positions: ACE sends onlythe end positions or orientations of movement seg-ments and delegates the robot movement generationentirely to the robot’s WBM controller;

2. sampling at each n-th frame: ACE sends control pa-rameters at a fixed rate to the robot’s WBM con-troller;

3. adaptive sampling rate: ACE “tethers” WBM usingdifferent sampling rates, ranging from one sampleper frame to taking only the end positions, depend-ing on the complexity of the trajectory.

If the trajectory is linear, then we can expect that strat-egy 1 above would serve as the best mechanism sinceonly distance information would likely be required. If,on the other hand, the trajectory is complex, we can ex-pect that strategy 2 would be optimal, since a sequenceof small movement vectors would likely be required to

guide the robot controller. If, however, the gesture isformed from different types of sub-movements as possi-ble in our framework, e.g. a linear trajectory for gesturepreparation with a curved trajectory for the stroke, wecan expect that the combined approach of strategy 3using an adaptive sampling rate would become optimal.

In our current set-up, we employ the second methodwith a maximal sampling rate, i.e. each successive frameof the movement trajectory is sampled and transmittedto the robot controller (n=1). Given a frame rate of 20frames per second (flexibly adjustable with ACE), thiscan result in a large number of sample points which, inturn, ensures that the robot closely follows the possi-bly complex trajectory planned by ACE. Results pre-sented in the following section were obtained with thismethod. Alternatively, using the third strategy wouldallow for adjusting the sampling rate depending on thetrajectory’s complexity, which may well vary from sim-ple straight movements (e.g. for gesture preparation)to complex curved shapes for the gesture stroke phase.Whether or not this strategy leads to improved resultsfor the generation of robot gesture in combination withACE is a point of future investigation.

A main advantage of our approach to robot con-trol is the trajectory formulation in terms of effectortargets and their respective orientations in task space.On this basis, it is fairly straightforward to derive ajoint space description for the Honda humanoid robotby using the standard WBM controller. Alternatively,joint angle values could be extracted from ACE anddirectly mapped onto the robot body model. However,being a virtual agent application, ACE does not en-tirely account for physical restrictions such as collisionavoidance, which may lead to joint states that are notfeasible on the robot. Therefore, by solving IK using therobot’s internally implemented WBM controller, we en-sure a safer generation of robot posture. Furthermore,studies in which subjects’ gaze was eye-tracked whileobserving hand and arm movements provide evidencethat humans mostly track the hand or end-point, evenif the movement is performed with the entire arm [23].Thus the form and meaning of a gesture can be con-veyed even with a deviation from original joint angles.

Having implemented an interface that couples ACEwith the perceptuo-motor system of the Honda robot,the control architecture outlined in Fig. 6 is now usedas the underlying action generation framework for thehumanoid robot. It combines conceptual representationand planning with motor control primitives for speechas well as hand and arm movements of a physical robotbody. Further details of the implementation are pre-sented in [35] and [36].

8 Maha Salem et al.

Fig. 6 Robot control architecture for the realization of multimodal behavior.

5 Technical Results

Results were produced in a feed-forward mannerwhereby commands indicating the wrist position andhand orientation of the ACE body model were trans-mitted in real-time to the robot at a sample rate of20 frames per second. Fig. 7 illustrates the multimodaloutput generated in our current framework using theMURML utterance presented in Fig. 3. The robot isshown next to a panel which displays the current stateof the internal robot body model and ACE kinematicbody model, respectively, at each time step. In addi-tion, speech output is transcribed to illustrate the wordsspanning different segments of the gesture movementsequence, indicating temporal synchrony achieved be-tween the two modalities. It is revealed that the phys-ical robot is able to perform a generated gesture fairlyaccurately but with some inertial delay compared to theinternal ACE model. This observation is supported byFig. 8, in which each dimension of the wrist position forthe ACE body model and the robot is plotted againsttime. Further results illustrating the difference in mo-tion speed between the two platforms as observed dur-ing the performance of various gestures are presentedin [35].

Despite the general limitation in motion speed, thesefindings substantiate the feasibility of the proposed ap-proach. Arbitrary MURML-based speech-gesture repre-sentations – as well as key-frame animation descriptionsof gestures, optionally derived from human motion cap-turing data – can be realized using the current frame-work. Extensive tests with multiple various gesture rep-resentations (including both one-armed and two-armedmovements) performed on the robot further revealedthat neglecting joint angle information as generated inACE does not impair the overall shape of a gesture.Hence, controlling the robot via task space commandsturns out to be an adequate and safe way to generatearm movements for the robot.

Although Fig. 7 suggests acceptable temporal syn-chrony between both output modalities, synchroniza-tion of speech and gesture does not yet appear to beoptimal. Tests using long sentences in speech as wellas utterances with the speech affiliate situated at thebeginning of the sentence revealed that movement gen-eration tends to lag behind spoken language output.Consequently, we need to explore ways to handle thedifference in time required by the robot’s physicallyconstrained body in comparison to the kinematic bodymodel in ACE. Our idea for future work is to tackle


Fig. 7 Example of a multimodal utterance realized with the current framework from the specification given in Fig. 3; forcomparison, the physical robot, internal robot body model, and the kinematic ACE body model are shown (left to right,top-down, sampled every four frames (0.16sec)) [36].

Fig. 8 Plots of x-, y- and z-coordinate respectively of the wrist positions of the ACE body model (solid) and the physicalrobot (dotted) during gesture execution [36].

this challenge by extending the cross-modal adapta-tion mechanisms provided by ACE with a more flex-ible multimodal utterance scheduler. This will allow fora finer mutual adaptation between robot gesture andspeech. In the current implementation, the ACE engineachieves synchrony within a chunk mainly by gestureadaptation to structure and timing of speech, obtain-ing absolute gesture time information at the phonemelevel. Improved synchronization requires the incorpora-tion of a forward model to predict the estimated timeneeded by the robot for gesture preparation. Addition-ally, predicted values must be controlled at run-timeand, if necessary, adjusted based on constantly updatedfeedback information on the robot state.

6 Empirical Evaluation: Unimodal versusMultimodal Robot Behavior in HRI

In order to gain a deeper understanding of how com-municative robot gesture may impact and shape hu-man experience in human-robot interaction (HRI), weconducted a between-subjects experimental study us-ing the Honda humanoid robot. For this purpose, wedesigned a suitable scenario for gesture-based HRI andidentified benchmarks to empirically evaluate the devel-oped framework. The study scenario comprised a jointtask that was to be performed by a human participantin collaboration with the Honda humanoid robot. Ourmain motivation for choosing a task-based interactionwas to realize a largely controllable yet meaningful in-teraction which would allow for a measurable compar-

10 Maha Salem et al.

ison of participants’ reported experiences. In the giventask, the robot referred to various objects by utilizingeither unimodal (speech only) or multimodal (speechand gesture) utterances, based on which the partici-pant was expected to perceive, interpret and performan according action.

6.1 Hypothesis

Based on findings resulting from gesture research inhuman-human as well as human-agent interaction wedeveloped the following hypothesis for gesture-basedhuman-robot interaction:

Subjects who are presented with multimodal instruc-tions by the robot (using speech and gesture) will eval-uate the robot more positively than those who are pre-sented with unimodal information by the robot (usingonly speech).

6.2 Experimental Design

The experiment was set in a kitchen environment inwhich the humanoid played the role of a householdrobot. Participants were told that they were helpinga friend move house and were tasked with emptyinga cardboard box of kitchen items, each of which hadto be placed in its designated location. The box con-tained nine kitchen items whose storage placement isnot typically known a priori (unlike plates, e.g., whichare usually piled on top of each other). Specifically, theycomprised a thermos flask, a sieve, a ladle, a vase, aneggcup, two differently shaped chopping boards and twodifferently sized bowls. The cardboard box containingthe kitchen items used in the experiment is displayedin Fig. 9.

Fig. 9 Cardboard box containing kitchen items used in theexperimental study.

B

KC

P

R

KC

P

= Robot

= Participant

= Kitchen cupboard

C2

T

= Table

= Minimum distance label

(~1 meter from robot)R

C3

Ci

= Video camera

C1

T

= Box with kitchen itemsB

= Curtains separating

operator area from lab

Fig. 10 Sketch of the experimental set-up in the lab.

The objects were to be removed from the box andarranged in a pair of kitchen cupboards (upper andlower cupboard with two drawers). For this, the par-ticipant was allowed to move freely in the area in frontof the robot, typically walking between the cardboardbox with items and the kitchen cupboards. Given theparticipant’s non-familiarity with the friend’s kitchenenvironment, the robot was made to assist the humanwith the task by providing information on where eachitem belongs. A table situated beside the kitchen cup-board was provided for the case that the participantdid not understand where the item had to be placed. Asketch of the experimental set-up is shown in Fig. 10.

Conditions

We manipulated the robot’s non-verbal behavior in threeexperimental conditions:

– In Condition 1, the unimodal (speech-only) condi-tion, the robot presented the participant solely witha set of nine verbal instructions to explain whereeach object should be placed. The robot did notmove its body during the whole interaction; no ges-ture or gaze behaviors were displayed.

– In Condition 2, the congruent multimodal (speech-gesture) condition, the robot presented the partic-ipant with the identical set of nine verbal instruc-tions used in condition 1. In addition, they wereaccompanied by a total of 21 corresponding ges-tures explaining where each object should be placed.Speech and gesture were semantically matching, e.g.the robot said “put it up there” and pointed up.Simple gaze behavior supporting hand and arm ges-tures (e.g. looking right when pointing right) wasdisplayed during the interaction.

– In Condition 3, the incongruent multimodal (speech-gesture) condition, the robot presented the partic-


Fig. 11 Example of a multimodal two-chunk utterance delivered by the robot during interaction.

Fig. 12 Example of a multimodal three-chunk utterance delivered by the robot during interaction. Three different typesof gesture are used (left to right): iconic gesture illustrating the shape of the vase; pantomimic gesture conveying the act ofopening the cupboard; deictic gesture pointing at designated position [37].

ipant with the identical set of nine verbal instruc-tions used in condition 1. Again, in addition, theywere accompanied by a total of 21 gestures, out ofwhich 47.6 % semantically matched the verbal in-struction, while the remaining 52.4 % of gestureswere semantically non-matching, e.g. the robot oc-casionally said “put it up there” but pointed down-wards. The reason for combining semantically non-matching gestures with matching ones in this con-dition was to avoid a complete loss of the robot’scredibility after a few utterances. Simple gaze be-havior supporting hand and arm gestures (e.g. look-ing right when pointing right) was displayed duringthe interaction.

Verbal Utterances

In order to keep the task solvable under all three con-ditions, we decided to design the spoken utterances ina self-sufficient way, i.e. the gestures used in the multi-modal condition contained redundant information thatwas also conveyed via speech. Each instruction pre-sented by the robot typically consisted of two or three

continuously connected utterance chunks. Based on thedefinition provided in [18], each chunk refers to a singleidea unit represented by an intonation phrase and, op-tionally in a multimodal utterance, by an additional co-expressive gesture phrase. The verbal utterance chunkswere based on the following syntax:

– Two-chunk utterance:<Please take the [object]>

<and place it [position+location].>

Example: Please take the thermos flask and place it on

the right side of the upper cupboard.

– Three-chunk utterance:<Please take the [object],>

<then open the [location]>

<and place it [position].>

Example: Please take the eggcup, then open the right

drawer and place it inside.

Examples of a multimodal two-chunk and a three-chunkutterance delivered by the robot are illustrated in Fig. 11and Fig. 12 respectively.


Table 1 Dependent measures used to evaluate the quality of presentation:

Measure: Questionnaire Item: Scale:

Gesture Quantity “The amount of gestures performed by the robot were...” 1 = too few, 5 = too manyGesture Speed “The execution of gestures was...” 1 = too slow, 5 = too fastGesture Fluidity “The execution of hand and arm movements was fluid.” 1 = not appropriate, 5 = very appropriateSpeech-Gesture Content “The robot’s speech and gesture were semantically matching (content).” 1 = not appropriate, 5 = very appropriateSpeech-Gesture Timing “The robot’s speech and gesture were well synchronized (timing).” 1 = not appropriate, 5 = very appropriateNaturalness “The combined use of speech and gesture appeared...” 1 = artificial, 5 = natural

Gestures

In the multimodal conditions, the robot used three dif-ferent types of gesture along with speech to indicate thedesignated placement of each item:

– Deictic gestures, e.g. to indicate positions and lo-cations

– Iconic gestures, e.g. to illustrate shape/size of ob-jects

– Pantomimic gestures, e.g. hand movement usinga ladle or opening cupboard doors

Examples of the three gesture types are displayed inFig. 12.

Robot control and behavior

During the study, the Honda humanoid robot was partlycontrolled using a Wizard-of-Oz technique to ensureminimal variability in the experimental procedure. Theexperiment room was partitioned with a curtain suchthat the robot and kitchen environment were located atone end and the wizard operating the control computerwas located at the other end, outside the participant’sfield of view.

The robot’s speech was identical across conditions.It was generated using the text-to-speech system MARY[39] set to a neutral voice. To avoid uncertainties, nei-ther speech recognition nor active vision were used dur-ing the experiment. Instead, the experimenter initiatedthe robot’s interaction behavior from a fixed sequence ofpre-determined utterances. Once triggered, a given ut-terance was generated autonomously at run-time. Theordering and generation of this sequence remained iden-tical across conditions and experimental runs.

The robot delivered each two-chunk or three-chunkinstructional utterance as a singular one-shot expres-sion without any significant breaks in the delivery pro-cess. Successive chunks indicating object, position andlocation were delivered contiguously in the manner ofnatural speech. Moreover, in the co-verbal gesture con-ditions, gestures became confluent with the utteranceprocess. Participants were instructed to indicate when

they had finished placing an item and were ready forthe following item by saying “next”.

6.3 Dependent Measures

Based on the participants’ answers to a post-experimentquestionnaire using a five-point Likert scale for eachitem, we investigated two main aspects of the reportedinteraction experience: firstly, the perceived quality ofpresentation was measured using six questionnaireitems; secondly, the perception of the robot was assessedbased on eight characteristics covered by additionalquestionnaire items. Table 1 and 2 give an overview ofthe dependent measures, questionnaire items and scalesused, respectively, for each evaluation category.

Table 2 Dependent measures used to evaluate the perceptionof the robot :

Measure: Questionnaire Item: Scale:

sympathetic

“Please assess to whichextent the followingcharacteristics apply tothe robot: [...]”

1 = not appropriate,5 = very appropriate

competentlivelyactiveengagedfriendlycommunicativefun-loving

6.4 Participation

A total of 60 subjects (30 female, 30 male) participatedin the experiment, ranging in age from 20 to 62 years(M = 31.12, SD = 10.21). All subjects were native Ger-man speakers who were recruited at Bielefeld Universityand had never before participated in a study involvingrobots. Based on five-point Likert scale ratings (1 =very little, 5 = very much), participants were identifiedas having negligible experience with robots (M = 1.22,SD = 0.45), while their computer and technology know-how was moderate (M = 3.72, SD = 0.90). Participants


were randomly assigned to one of the three differentexperimental conditions (i.e. 20 participants per condi-tion), while maintaining gender- and age-balanced dis-tributions.

6.5 Experimental Procedure

Participants were first given a brief written scenarioand task description to read outside the experimentallab. They were then brought into the experiment roomwhere the experimenter verbally reiterated the task de-scription to ensure the participants’ familiarity. Partic-ipants were given the opportunity to ask any clarifyingquestions. The experimenter then left the participantto begin the interaction with the robot. At the begin-ning of the experiment, the robot greeted the partici-pant and gave a verbal introduction to the task. It thenpresented the participant with individual utterances asdescribed in the experimental design, each of which wastriggered by the experimenter sitting at a control ter-minal. The participant attempted to follow the utteredinstructions by placing each item into its designated lo-cation. At the end of the interaction, the robot thankedthe participant for helping and bid them farewell.

In the unimodal (speech-only) condition all utter-ances including the greeting and farewell were presentedverbally; in the multimodal (speech-gesture) conditions,all utterances including the greeting and farewell wereaccompanied by co-verbal gestures.

After completing the task, participants filled out apost-experiment questionnaire that recorded their de-mographic background and, based on a five-point Likertscale, measured their affective state, evaluation of thetask and interaction, and perception of the robot. Uponcompletion of the questionnaire, the participants werede-briefed and received a chocolate bar as a thank-you.The questionnaire data was collated and analyzed, theresults are presented and discussed in the following.

6.6 Results and Discussion

Questionnaire data was analyzed regarding the effect ofexperimental conditions on assessment of presentationquality and robot perception.

Quality of Presentation

We investigated the perceived quality of presentationwith regard to gesture, speech, and content. Mean val-ues and standard deviations are summarized in Table 3.Note that for condition 1 (unimodal) only gesture quan-tity was measured, since participants in this condition

Table 3 Mean values for the rating of presentation qualityin the three conditions (standard deviations in parentheses).

Cond. 1: Cond. 2: Cond. 3:Unimodal Congruent Incongruent

Gesture Quantity 1.90 (.99) 2.80 (.62) 3.00 (.56)Gesture Speed 2.85 (.37) 2.95 (.22)Gesture Fluidity 3.25 (.97) 3.95 (1.05)Speech-Gesture Content 3.65 (1.04) 3.30 (1.26)Speech-Gesture Timing 3.90 (.79) 4.05 (1.10)Naturalness 3.20 (1.06) 3.30 (1.13)

were not presented with any non-verbal behavior by therobot and thus could not rate the quality of the robot’sgestures.

With regard to gesture quantity, the overall meanvalue for the two gesture conditions was M = 2.90(SD = 0.59). This means, participants were quite sat-isfied with the gesture rate. For the unimodal condi-tion, participants rated gesture quantity as rather low(M = 1.90, SD = 0.99), which can be attributed to thelack of non-verbal behavior displayed by the robot.

For the multimodal conditions, gesture quality wasfurther measured based on five attributes (overall meanvalue and standard deviation for the two gesture condi-tions in parentheses): gesture speed (M = 2.90, SD =0.30), gesture fluidity (M = 3.60, SD = 1.06), semanticmatching of speech and gesture (M = 3.48, SD = 1.14),temporal matching of speech and gesture (M = 3.97,SD = 0.95), and naturalness (M = 3.25, SD = 1.08). Inboth gesture conditions, the five quality attributes wererated with mean values between 2.8 and 4.1 on five-point Likert-scales, indicating that participants weregenerally satisfied with the quality of gestures performedby the robot.

Perception of the Robot

We assessed how the humanoid robot was perceivedby participants using eight characteristics. To test ourhypothesis we conducted independent-samples t-testswith 95% confidence intervals as follows: first, we com-pared questionnaire data from condition 1 with condi-tion 2 (unimodal vs. congruent multimodal); second, wecompared data from condition 1 with condition 3 (uni-modal vs. incongruent multimodal). Mean values for therobot’s perception scales in the three different condi-tions are listed together with their standard deviationvalues in Table 4 and are visualized in Fig 13. Itemsshowing statistically significant effects in multimodalgesture conditions compared to the unimodal speech-only condition are marked with an asterisk (*).


Table 4 Mean values for the rating of robot perception inthe three conditions based on a 5-point Likert scale (standarddeviations in parentheses); + = p ≤ 0.10, * = p ≤ 0.05,** = p ≤ 0.01, *** = p ≤ 0.001.

Cond. 1: Cond. 2: Cond. 3:Unimodal Congruent Incongruent

sympathetic 3.60 (1.05) 4.20 (.95)+ 4.15 (1.09)competent 3.85 (.93) 4.26 (.87) 3.75 (1.16)lively 2.52 (.84) 3.12 (.97)* 3.32 (.76)**active 2.35 (.88) 3.20 (1.11)** 3.45 (.76)***engaged 3.25 (1.29) 3.60 (1.35) 4.15 (.88)*friendly 4.15 (1.04) 4.35 (1.31) 4.60 (.68)communicative 3.00 (1.08) 3.15 (1.31) 3.60 (1.05)+

fun-loving 1.95 (.83) 2.65 (1.23)* 2.70 (1.30)*

Me

an

Va

lue

5

4

3

2

1

Dependent Measures

fun-lovingcommunicativefriendlyengagedactivelivelycompetentsympathetic

Incongruent MultimodalCongruent MultimodalUnimodal

Condit ion

Page 1

p = 0.065+

p = 0.044*

p = 0.003**

p = 0.01**

p = 0.000***

p = 0.014*

p = 0.082+

p = 0.037*

p = 0.041*

Me

an

Va

lue

5

4

3

2

1

Dependent Measures



Condit ion

Page 1

p = 0.065+

p = 0.044*

p = 0.003**

p = 0.01**

p = 0.000***

p = 0.014*

p = 0.082+

p = 0.037*

p = 0.041*

Me

an

Va

lue

5

4

3

2

1

Dependent Measures



Condit ion

Page 1

p = 0.065+

p = 0.044*

p = 0.003**

p = 0.01**

p = 0.000***

p = 0.014*

p = 0.082+

p = 0.037*

p = 0.041*Me

an

Va

lue

5

4

3

2

1

Dependent Measures



Condit ion

Page 1

p = 0.065+

p = 0.044*

p = 0.003**

p = 0.01**

p = 0.000***

p = 0.014*

p = 0.082+

p = 0.037*

p = 0.041*

Me

an

Va

lue

5

4

3

2

1

Dependent Measures



Condit ion

Page 1

p = 0.065+

p = 0.044*

p = 0.003**

p = 0.01**

p = 0.000***

p = 0.014*

p = 0.082+

p = 0.037*

p = 0.041*

Fig. 13 Mean values of the dependent measures rating par-ticipants’ perception of the robot.

On average, all qualities were rated higher, i.e. morepositively, in the multimodal gesture conditions, witha lower mean value for ‘competent’ in the incongruentspeech-gesture condition being the exception. These re-sults support our hypothesis and suggest that the in-clusion of gestural behavior casts the robot in a morepositive light than in the speech-only condition.

Comparing condition 1 (unimodal) with condition 2(congruent multimodal), the three characteristics ‘lively’(t(38) = -2.09, p = 0.044), ‘active’ (t(38) = -2.70, p =0.01) and ‘fun-loving’ (t(38) = -2.12, p = 0.041) are ob-served to be significantly higher in the congruent with-gesture condition than in the unimodal condition us-ing speech only. In addition, a comparison of the char-acteristic ‘sympathetic’ between conditions 1 and 2 isshown to be significant at the 10 % level (t(38) = -1.90,

p = 0.065), with higher mean values in the congruentmultimodal condition.

When comparing condition 1 (unimodal) with con-dition 3 (incongruent multimodal), the four characteris-tics ‘lively’ (t(38) = -3.17, p = 0.003), ‘active’ (t(38) = -4.25, p = 0.000), ‘engaged’ (t(38) = -2.58, p = 0.014)and ‘fun-loving’ (t(32.16) = -2.18, p = 0.037) are foundto be rated significantly higher in the multimodal con-dition. In addition, comparing the characteristic ‘com-municative’ between condition 1 and 3 shows a signifi-cant effect at the 10 % level (t(38) = -1.79, p = 0.082),with higher mean values in the incongruent multimodalcondition.

An additional comparison of data from condition2 with condition 3 (congruent vs. incongruent multi-modal) showed no significant effect of experimental con-ditions. However, with the exception of dependent mea-sures ‘sympathetic’ and ‘competent’, our analyses indi-cated a trend towards higher mean values in the incon-gruent multimodal condition

The significantly higher rating of ‘lively’ and ‘active’in the two multimodal conditions can be attributed tothe robot’s gestural movements, since the robot appearscomparatively stiff in the speech-only condition. Theratings of the characteristics ‘fun-loving’, ‘engaged’,‘sympathetic’ and ‘communicative’ suggest that human-like non-verbal behaviors including gestures actuallytrigger a more positive response within the human par-ticipant. The results further reveal that even a robotthat occasionally makes incorrect gestures is still morefavorable than one that performs no hand and arm ges-tures at all. In fact, on average the robot is evaluatedas more lively, active, engaged, friendly, communicativeand fun-loving in the incongruent speech-gesture condi-tion compared with the congruent condition. This sug-gests that a robot’s non-verbal communicative behaviorcan even trigger a stronger positive response within thehuman participant when it is not ‘perfect’. Overall, theresults demonstrate that co-verbal gestures performedby a humanoid robot lead to an enhanced human-robotinteraction experience, i.e. the robot is generally ratedmore positively when it displays non-verbal behaviors.These findings support our approach to endow socialrobots with communicative gestural behavior.

7 Conclusion and Future Work

We presented a robot control architecture which en-ables the Honda humanoid robot to generate gesturesand synchronized speech at run-time, while not beinglimited to a pre-defined repertoire of motor actions. Thepresent framework builds upon a speech and gesture


production model for virtual human agents. Represent-ing a sophisticated multimodal scheduler, the Articu-lated Communicator Engine (ACE) allows for an on-line production of flexibly planned behavior representa-tions. Our framework combines conceptual, XML-basedrepresentation and planning with motor control primi-tives for speech and arm movements.

Meeting strict temporal synchrony constraints willpresent a main challenge to our framework in the future.Evidently, the generation of finely synchronized multi-modal utterances proves to be more demanding whenrealized on a robot with a physically constrained bodythan for an animated virtual agent, especially whencommunicative signals must be produced at run-time.Currently, the ACE engine achieves synchrony mainlyby gesture adaptation to structure and timing of speech,obtaining absolute time information at phoneme level.To tackle this new dimension of requirements, how-ever, the cross-modal adaptation mechanisms appliedin ACE have to be extended to allow for a finer mutualadaptation between robot gesture and speech. For this,afferent feedback provided by our robot control archi-tecture needs to be integrated into a more sophisticatedscheduler.

In order to investigate how humans perceive repre-sentational hand and arm gestures performed by therobot during a task-related interaction, we evaluatedour technical framework in an experimental study us-ing the Honda humanoid robot. Our findings reveal thatthe perception and evaluation of the robot is rated morepositively when it displays non-verbal behaviors in theform of co-verbal gestures along with speech. This isalso true for hand and arm gestures that do not se-mantically match the information content conveyed viaspeech, suggesting that a humanoid robot that gener-ates gestures – even if in part they are semantically‘incorrect’ – is still more favorable than one that per-forms no gestures at all. In fact, on average the robot isevaluated as more lively, active, engaged, friendly, com-municative and fun-loving in the incongruent speech-gesture condition compared with the congruent condi-tion. This suggests that the robot’s non-verbal commu-nicative behavior triggers a stronger positive responsewithin the human participant when it is not ‘perfect’and thus potentially less predictable. These implica-tions should be further elucidated in subsequent stud-ies to point out the direction for future social roboticsresearch that is dedicated to the design of acceptablebehaviors for artificial communicators.

In the study presented, the robot’s gaze behaviorwas modeled in a very simplistic way in the multimodalconditions; robot gaze in the speech-only condition wasstatic throughout the interaction. These design choices

were made on purpose to direct the participants’ at-tention to the hand and arm movements performed bythe robot in the speech-gesture conditions. As a con-sequence, however, the robot’s gazing behavior did notappear very natural during the interaction, since therobot did not follow the human interaction partner withits gaze. In future studies, it will be desirable to inves-tigate the impact and interaction of the robot’s gaze incombination with gestural hand and arm movements.

Despite some limitations, our results do nonethe-less suggest that a robot presenting social cues in theform of co-verbal hand and arm gestures, as generatedwith our framework, is perceived in a more positive waythan a robot whose sole means of communication is lim-ited to a single modality, namely speech. These findingscontribute to an advancement in human-robot interac-tion and give new insights into human perception andunderstanding of gestural machine behaviors. Specifi-cally, they shed light on how humans perceive and in-terpret utterances in relation to different communica-tion modalities. Our findings suggest that human-likebehavior in a humanoid robot has a positive impacton the way humans perceive the robot in an interac-tion. Ultimately, these results will allow us to designand build better artificial communicators in the future.

Acknowledgements The work described was supported bythe Honda Research Institute Europe.

References

1. Bennewitz, M., Faber, F., Joho, D., Behnke, S.: Fritz– A Humanoid Communication Robot. In: RO-MAN07: Proc. of the 16th IEEE International Symposium onRobot and Human Interactive Communication (2007)

2. Bergmann, K., Kopp, S., Eyssel, F.: Individualized ges-turing outperforms average gesturing – evaluating ges-ture production in virtual humans. In: Proceedings ofthe 10th Conference on Intelligent Virtual Agents, pp.104–117. Springer (2010)

3. Billard, A., Calinon, S., Dillmann, R., Schaal, S.: RobotProgramming by Demonstration. In: B. Siciliano,O. Khatib (eds.) Handbook of Robotics, pp. 1371–1394.Springer, Secaucus, NJ, USA (2008)

4. Calinon, S., Billard, A.: Learning of Gestures by Imita-tion in a Humanoid Robot. In: K. Dautenhahn, C. Ne-haniv (eds.) Imitation and Social Learning in Robots,Humans and Animals: Behavioural, Social and Commu-nicative Dimensions, pp. 153–177. Cambridge UniversityPress (2007)

5. Cassell, J., Bickmore, T., Campbell, L., Vilhjalmsson, H.,Yan, H.: Human Conversation as a System Framework:Desigining Embodied Conversational Agents. In: Embod-ied Conversational Agents, pp. 29–63. MIT Press: Cam-bridge, MA (2000)

6. Gienger, M., Janßen, H., Goerick, S.: Task-OrientedWhole Body Motion for Humanoid Robots. In: Pro-ceedings of the IEEE-RAS International Conference onHumanoid Robots. Tsukuba, Japan (2005)


7. Goldin-Meadow, S.: The role of gesture in communicationand thinking. Trends in Cognitive Science 3, 419–429(1999)

8. Gorostiza, J., Barber, R., Khamis, A., Malfaz, M.,Pacheco, R., Rivas, R., Corrales, A., Delgado, E., Salichs,M.: Multimodal Human-Robot Interaction Frameworkfor a Personal Robot. In: RO-MAN 06: Proc. of the 15thIEEE International Symposium on Robot and HumanInteractive Communication (2006)

9. Hartmann, B., Mancini, M., Pelachaud, C.: Implement-ing Expressive Gesture Synthesis for Embodied Conver-sational Agents. Gesture in Human-Computer Interac-tion and Simulation (2005)

10. Honda Motor Co., L.: The Honda Humanoid RobotAsimo, year 2000 model. http://world.honda.com/ASIMO/technology/spec.html (2000)

11. Hostetter, A.B.: When do gestures communicate? Ameta-analysis. Psychological Bulletin 137(2), 297–315(2011)

12. Hostetter, A.B., Alibali, M.W.: Visible embodiment: Ges-tures as simulated action. Psychonomic Bulletin and Re-view 15(3), 495–514 (2008)

13. Itoh, K., Matsumoto, H., Zecca, M., Takanobu, H., Roc-cella, S., Carrozza, M., Dario, P., Takanishi, A.: Variousemotional expressions with emotion expression humanoidrobot we-4rii. In: Proceedings of the 1st IEEE TechnicalExhibition Based Conference on Robotics and Automa-tion Proceedings TExCRA 2004, pp. 35–36 (2004)

14. Kendon, A.: Gesticulation and Speech: Two Aspects ofthe Process of Utterance. The Relationship of Verbal andNon-verbal Communication pp. 207–227 (1980)

15. Kendon, A.: Gesture: Visible action as utterance. Gesture6(1), 119–144 (2004)

16. Kopp, S., Bergmann, K., Wachsmuth, I.: MultimodalCommunication from Multimodal Thinking - Towards anIntegrated Model of Speech and Gesture Production. Se-mantic Computing 2(1), 115–136 (2008)

17. Kopp, S., Wachsmuth, I.: A Knowledge-based Approachfor Lifelike Gesture Animation. In: W. Horn (ed.) ECAI2000 - Proceedings of the 14th European Conference onArtificial Intelligence, pp. 663–667. IOS Press, Amster-dam (2000)

18. Kopp, S., Wachsmuth, I.: Synthesizing Multimodal Ut-terances for Conversational Agents. Computer Anima-tion and Virtual Worlds 15(1), 39–52 (2004)

19. Kramer, N., Simons, N., Kopp, S.: The effects of an em-bodied conversational agents nonverbal behavior on usersevaluation and behavioral mimicry. In: Proc. of Intelli-gent Virtual Agents (IVA 2007), vol. 4722, pp. 238–251.Springer (2007)

20. Kranstedt, A., Kopp, S., Wachsmuth, I.: MURML: AMultimodal Utterance Representation Markup Languagefor Conversational Agents. In: Proceedings of the AA-MAS02 Workshop on Embodied Conversational Agents -let’s specify and evaluate them. Bologna, Italy (2002)

21. Levelt, W.: Speaking. MIT Press, Cambridge, MA (1989)22. Macdorman, K., Ishiguro, H.: The Uncanny Advantage of

Using Androids in Cognitive and Social Science Research.Interaction Studies 7(3), 297–337 (2006)

23. Mataric, M.J., Pomplun, M.: Fixation Behavior in Ob-servation and Imitation of Human Movement. CognitiveBrain Research 7(2), 191 – 202 (1998)

24. McNeill, D.: Hand and Mind: What Gestures Revealabout Thought. University of Chicago Press, Chicago(1992)

25. McNeill, D.: Gesture and Thought. University of ChicagoPress, Chicago (2005)

26. Minato, T., Shimada, M., Ishiguro, H., Itakura, S.: Devel-opment of an Android Robot for Studying Human-RobotInteraction. Innovations in Applied Artificial Intelligencepp. 424–434 (2004)

27. Miyashita, T., Shinozawa, K., Hagita, N.: Gesture Trans-lation for Heterogeneous Robots. In: Proceedings of6th IEEE-RAS International Conference on HumanoidRobots, pp. 462–467 (2006)

28. Mori, M.: The uncanny valley (K. F. MacDorman & T.Minato, Trans.). Energy 7(4), 33–35 (1970)

29. Nehaniv, C.L., Dautenhahn, K.: The correspondenceproblem (1998)

30. Ng-Thow-Hing, V., Luo, P., Okita, S.: Synchronized ges-ture and speech production for humanoid robots. In:Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems, pp. 4617–4624 (2010)

31. Niewiadomski, R., Bevacqua, E., Mancini, M.,Pelachaud, C.: Greta: An Interactive Expressive ECASystem. In: Proceedings of 8th Int. Conf. on AutonomousAgents and Multiagent Systems (AAMAS2009), pp.1399–1400 (2009)

32. Okuno, Y., Kanda, T., Imai, M., Ishiguro, H., Hagita,N.: Providing route directions: Design of robot’s utter-ance, gesture, and timing. In: Proceedings of the 4thACM/IEEE International Conference on Human RobotInteraction, HRI ’09, pp. 53–60. ACM (2009)

33. Pollard, N., Hodgins, J., Riley, M., Atkeson, C.: AdaptingHuman Motion for the Control of a Humanoid Robot. In:Proceedings of International Conference on Robotics andAutomation, pp. 1390–1397 (2002)

34. Reiter, E., Dale, R.: Building Natural Language Genera-tion Systems. Cambridge Univ. Press (2000)

35. Salem, M., Kopp, S., Wachsmuth, I., Joublin, F.: Gener-ating robot gesture using a virtual agent framework. In:Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (2010)

36. Salem, M., Kopp, S., Wachsmuth, I., Joublin, F.: Towardsan integrated model of speech and gesture productionfor multi-modal robot behavior. In: Proceedings of theIEEE International Symposium on Robot and HumanInteractive Communication (2010)

37. Salem, M., Rohlfing, K., Kopp, S., Joublin, F.: A friendlygesture: Investigating the effect of multimodal robot be-havior in human-robot interaction. In: Proceedings ofthe IEEE International Symposium on Robot and Hu-man Interactive Communication (2011)

38. Saygin, A., Chaminade, T., Ishiguro, H., Driver, J., Frith,C.: The thing that should not be: predictive coding andthe uncanny valley in perceiving human and humanoidrobot actions. Soc Cogn Affect Neurosci (2011)

39. Schroder, M., Trouvain, J.: The German Text-to-SpeechSynthesis System MARY: A Tool for Research, Develop-ment and Teaching. In: International Journal of SpeechTechnology, pp. 365–377 (2003)

40. Sidner, C., Lee, C., Lesh, N.: The Role of Dialog in Hu-man Robot Interaction. In: International Workshop onLanguage Understanding and Agents for Real World In-teraction (2003)

41. Sugiyama, O., Kanda, T., Imai, M., Ishiguro, H.,Hagita, N.: Natural deictic communication with hu-manoid robots. In: Proceedings of the IEEE InternationalConference on Intelligent Robots and Systems, pp. 1441–1448 (2007)

42. Wachsmuth, I., Kopp, S.: Lifelike Gesture Synthe-sis and Timing for Conversational Agents. In:I. Wachsmuth, T. Sowa (eds.) Gesture and Sign Lan-guage in Human-Computer Interaction, LNAI 2298, pp.120–133. Springer, Berlin (2002)


8 Vitae/Biography

Maha Salem is a Ph.D. student at the Research Insti-tute for Cognition and Robotics (CoR-Lab), BielefeldUniversity, Germany. Her research interests lie in thefield of social human-robot interaction with special fo-cus on multimodal interaction and non-verbal expres-siveness in humanoid robots.

Stefan Kopp is director of the Sociable Agents Groupof the Center of Excellence ‘Cognitive Interaction Tech-nology’ (CITEC) at Bielefeld University, Germany. Heis heading projects in the Collaborative Research Cen-ter ‘Alignment in Communication’ (CRC 673) as wellas the CoR-Lab.

Ipke Wachsmuth is a professor of Artificial Intelli-gence at Bielefeld University, Germany. He is co-initiatorand coordinator of the Collaborative Research Center‘Alignment in Communication’ (CRC 673).

Katharina Rohlfing received her Ph.D. in Linguis-tics from Bielefeld University, Germany, in 2002. Since2008, she is head of the Emergentist Semantics Groupwithin the Center of Excellence ‘Cognitive InteractionTechnology’ (CITEC), Bielefeld University, Germany.She is interested in multimodal learning processes.

Frank Joublin is Principal Scientist at the Honda Re-search Institute Europe, Offenbach, Germany, and since2008 a board member of the CoR-Lab Graduate School,Bielefeld University, Germany.

generation and evaluation of communicative robot …...robot (fig. 1). based on the implementation...

Documents