fluency and embodiment for robots acting with humansguy/publications/thesis... · 2007. 2. 14. ·...

Fluency and Embodimentfor

Robots Acting with Humans

Guy Hoffman

October 27, 2006

Thesis Proposal for the Degree of Doctor of Philosophyat the

Massachusetts Institute of Technology

Thesis Advisor:

Cynthia BreazealAssociate Professor of Media Arts and Sciences


Thesis Reader:

Alex PentlandProfessor of Media Arts and Sciences


Thesis Reader:

Lawrence BarsalouProfessor of Psychology

Emory University

Contents

1 Introduction 3Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Approach and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background 6Embodied Cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Joint Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Acting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Related Work: Synthetic Theatrical Agents . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Prior Work 14

4 Research Architecture 16Perception-Based Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Action-Perception Activation Networks . . . . . . . . . . . . . . . . . . . . . . . . . 21Prediction and Anticipated Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Theatrical Scene Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Research Derivatives 26

6 Demonstrations 27Human-robot Teamwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Human-robot Theatrical Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 30Interim Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Resources 34

8 Timeline 35

References 36

Fluency and Embodiment for Robots Acting with Humans

1 Introduction

The thesis proposed herein is concerned with the formulation of a physically groundedcognitive architecture for robotic agents, enabling them to perform1 fluidly with humans.

Aim

The central goal of the proposed work is to steer away from the stop-and-go rigidity presentin virtually all human-robot interaction to date. It aims to investigate the notion of fluentinteraction, and uncover some of the mechanisms that allow two humans to display an im-pressive level of coordination, flexibility, and adaptation. When humans perform together,and even more so when they are well-accustomed to the task and to each other, their tim-ing is precise and efficient, they alter their plans and action appropriately and dynamically,and this behavior emerges often without exchanging much verbal information. I hope touncover and emulate some of these phenomena in human-robot joint action.

Given the above, a natural second goal of this work is to investigate the function ofnonverbal behavior in joint action, and its implications and utility for human-robot jointaction. I argue that the emergence of nonverbal behavior from its role in the cognitivefunctioning of teamwork could lead to its use in communicating tasks, common ground,and timing in a fluid human-robot team.

A third interest of the proposed work is to model the effect of practice on human-robotjoint action, and similarly that of rehearsal on a human-robot joint performance. It is likelythat mechanisms that govern the passage of cognitive capabilities from a deliberate andflexible, yet comparatively slower system to a faster, sub-intentional, and more rigid one,are crucial to fluent joint action in well-rehearsed teams and ensembles.

State of the Art

Robots acting together with humans can be classified according to their degree of auton-omy, ranging from full autonomy, through goal-based (mixed initiative) operation, viawaypoint and heuristic specification, and down to full teleoperation [Goodrich et al., 2001].This work is specifically concerned with an architecture supporting a fully or nearly au-tonomous robot, rather than with another area of research also coined “human-robot col-laboration”, but which is concerned with questions of control of a remote or local machine.Also, while some of the conclusions from this thesis (for example — anticipatory actionand practice) could well be incorporated into robotic teleoperation research, such as thatof [Fong et al., 2003], I am mainly concerned with the emergence of fluent interaction be-tween a human and a robot at a shared location, making use of the co-located partners’nonverbal behavior to achieve joint action fluency.

1‘perform’ is intentionally left vague, denoting both the execution of actions and the undertaking of theatricalperformance.

3


Within this definition, most interactive robotic systems to date restrict their joint actionwith a human counterpart to a rather slow and heavily turn-based framework, making thepursuit of fluency a worthwhile endeavor.

For example, HERMES is a recent humanoid service robot for office environments [Bischoffand Graefe, 2004]. Its interaction with humans suffers from very structured turn-taking re-quirements including many pauses. Nonverbal behavior is practically absent on both theperceptual and the generative side. Other systems, even when they take into account theimportance of a nonverbal channel of interaction (such as [Sidner et al., 2003, Sidner et al.,2006]), the dialog between the human and the robot is piecewise and feels more delayedthan fluent.

The systems described are also mainly concerned with human-robot dialog, and manysimilar abound (and are usually afflicted by comparable flaws). Robots that act jointly withhumans, however, are few and far between.

Shared-location collaborative robots often include a robotic arm sitting vis-a-vis a hu-man solving a task [Kimura et al., 1999]. In a recent case, a robot assists a human in an as-sembly task, and intervenes in one of three ways: taking over for the human, disambiguat-ing a situation, or executing an action simultaneously with a human. This happens whenthe human asks for assistance or seems in need of help based on perceptual input [Sakitaet al., 2004]. While relying on some nonverbal symbols, the result is still an interactionwhich is slow, heavily delayed, and stepwise.

More recently, humans and robots have interacted in a soccer game setting, but theseare extremely initial efforts, in which the robot is given very little autonomy compared tothe human team member [Argall et al., 2006].

In the field of robots for theater performance, most work has dealt with fully automatedor extremely simple behavior on one end of the spectrum (for a good review, see [Dixon,2004]) or fully teleoperated robots, such as the recent production of “Heddatron” [Les Fr-eres Corbusier, 2006] on the other. Throughout the gamut, the focus of the work is theconceptual relationship between man and machine, rather than the interaction between an au-tonomous robot and a human. In other work, robots have been pitted against each otheron stage without the inclusion of a human scene partner [Bruce et al., 2000], but it is safe tosay that fluent theatrical dialog between an autonomous robot and a human scene partneris still an unattained aim.

Approach and Contributions

This thesis argues that a core shortfall of existing systems is their inherent separation be-tween perception, cognition, and action, whereas it is becoming increasingly clear that hu-man perception and action are not mere input and output channels to an abstract symbolprocessor or rule-generating engine, but that instead thought, memory, concepts, and lan-guage are inherently grounded in our physical presence [Lakoff and Johnson, 1999, Barsa-lou, 1999, Wilson, 2001, Pecher and Zwaan, 2005]).

4


Therefore, drawing from a growing body of knowledge in psychology, neuroscience,and theater acting theory, the cognitive robotic architecture presented herein diverges frommuch of the traditional work in artificial intelligence in that it emphasizes the embodiedaspects of cognition [Wilson, 2002], and offers an integrated view.

In parallel to an increasing interest in embodiment, psychologists are also beginning toassess the neuro-cognitive mechanisms of social interaction and in particular those of jointaction, investigating what enables us to coordinate our actions, and respond to each otherwith speed and precision [Barsalou et al., 2003, Sebanz et al., 2006]. Unsurprisingly, manyof the ingredients underlying joint action rest in embodied aspects of the team members’cognitive processes, such as motor-based simulation and prediction, mirror neural systems,and top-down perceptual anticipation.

It is my belief that these lessons can and should be transferred to robotic cognition aswell, and in particular if we are to design robots that act in natural and fluid dialog with ahuman partner.

Furthermore, the role of practice and rehearsal of a repetitive task has to date not beenaddressed in the context of human-robot joint action. This thesis wishes to investigate therelationship between a multi-layered embodied cognitive architecture and the effects ofpractice on the increased fluency of a joint task performance.

Finally, as part of this work I hope to examine a number of interrelated ideas as theystem from this approach and are relevant to human-robot joint activity; among them: (a)attention, (b) routine vs. intentional action, (c) flexible resolution, and (d) anticipatory action.

Implementations

Two implementations and experiments will serve as platforms for the demonstration andevaluation of the proposed approach:

Human-robot shared location teamwork In current research, human-robot teamwork fallsshort of the goal of fluid shared location collaboration. Robots working with humansare often teleoperated, programmed (using at times natural interfaces), or enabledlimited autonomy under tight supervision. A heavy verbal component, as well asrigid turn-taking, are present in virtually all collaborative robots. This work aims todemonstrate tightly coupled joint human-robot activity in a collaborative task setting,overcoming some of these limitations.

Human-robot theatrical dialogue While machines have played various parts in a num-ber of stage performances over the past decades, none of them can be consideredembodied, autonomous robotic actors. A theatrical scene fluidly enacted between ahuman and a robot taking advantage of non-verbal behavior, and emerging from anintegrated perception-action cognitive framework is still very much uncharted terri-

5


tory and will be explored in this thesis. This will be achieved through the combinedinsights from human psychology and human acting theory and practice.

2 Theoretical Background

Embodied Cognition

During its first few decades of existence, artificial intelligence has not only drawn from the-ories of cognitive psychology, but to a wide extent also shaped notions of amodal, symbolicinformation processing. According to this view, information is translated from perceptualstimuli into nonperceptual symbols, later used for information retrieval, decision making,and action production. This view also corresponds to much of the work currently done inrobotics, where sensory input is translated into semantic symbols, which are then operatedupon for the production of motor control.

An increasing body of recent findings challenges this view and suggests instead thatconcepts and memory are phenomena grounded in modal representations utilizing manyof the same mechanisms used during the perceptual process. A prominent theory ex-plaining these findings is one of “simulators”, siting memory and recall in the very neuralmodules that govern perception itself, subsequently used by ways of “simulation” or “im-agery” [Barsalou, 1999, Kosslyn, 1995]. Perceptual symbols are organized in cross-modalnetworks of activation which are used to dynamically reconstruct and produce knowledge,concepts, and decision making. This view is supported by evidence of inter-modal behav-ioral influences, as well as by the detection of perceptual neural activation when a subjectis using a certain concept in a non-perceptual manner (e.g. [Martin, 2001]).

Thus, when memory or language are invoked to produce behavior, the underlying per-ceptual processes elicit many of the same neural patterns and behaviors normally used toregulate perception [Spivey et al., 2005]. To name but a few examples, it has been shownthat reading a sentence that has an implied orientation reduces response time on imagerecognition that is similarly oriented [Stanfield and Zwaan, 2001]; memory recall impair-ment is found to match speech impediments in children (mistaking rings for wings in chil-dren that pronounce ‘r’s as ‘w’s) [Locke and Kutz, 1975]; and comparing visually similarvariations of a word (e.g. a pony’s mane and a horse’s mane) is faster than visually distinctvariations (e.g. a lion’s mane) [Solomon and Barsalou, 2001].

In parallel to a perception-based theory of cognition lies an understanding that cog-nitive processes are equally interwoven with motor activity. Evidence in human devel-opmental psychology shows that motor and cognitive development are not parallel but

6


highly interdependent. For example, research showed that artificially enhancing 3-monthold infants’ grasping abilities (through the wearing of a sticky mitten), equated some oftheir cognitive capabilities to the level of older, already grasping2, infants [Somerville et al.,2004].

A related case has been made with regard to hand signals, which are viewed by some asforemost instrumental to lexical lookup during language generation [Krauss et al., 1996],and is supported by findings of redundancy in head-movements [McClave, 2000] and facialexpression [Chovil, 1992] during speech generation.

A large body of work points to an isomorphic representation between perception andaction, leading to mutual and often involuntary influence between the two [Wilson, 2001].Some researches speak of specific ‘privileged loops’ from perception to action, for examplebetween speech and auditory perception, or visual perception and certain motor activ-ity, indicating these action systems’ roles in the perceptual pathway [McLeod and Posner,1984].

In addition, neurological findings indicate that—in some primates—observing an ac-tion and performing it causes activation in the same cerebral areas, a phenomenon labeled“mirror neurons” [Gallese and Goldman, 1996]. Similarly, listening to speech has beenshown to activate motor area related to speech production [Wilson et al., 2004], and in pi-anists, listening to a tonal sequence triggers neural activation in areas associate with fingermovement — both for the current tone, and for the next tone in cases where the subject isfamiliar with the melody. This common coding is thought to play a role in imitation, andthe relation of the behavior of others to our own, which is considered a central process inthe development of a Theory of Mind [Meltzoff and Moore, 1997], and possibly underliethe rapid and effortless adaptation to a partner that is needed to perform a joint task [Se-banz et al., 2006]. For a thorough review of these mirror neurological phenomena and therelated connection between perception and action as it pertains to human-robot interaction,see [Mataric, 2002].

Several computational insights emanate from this evidence. One is that perceptual pro-cessing is not a strictly bottom-up analysis of raw available data, as it is often modeledin robotic systems. Instead, simulations of perceptual processes prime the acquisition ofnew perceptual data, motor knowledge is used in sensory parsing, and intentions, goals,and expectations all play a role in the ability to parse the world into meaningful objects.This seems to be particularly true for the parsing of human behavior in a goal-oriented andanticipatory manner, a vital component of joint action.

Experimental data supports this hypothesis, finding perception to be predictive (for areview, see [Wilson and Knoblich, 2005]). In vision, information is sent both upstreamand downstream, and object priming triggers top-down processing, biasing lower-levelmechanisms in sensitivity and criterion [Kosslyn, 1995]. Similarly, visual lip-reading affectsthe perception of auditory syllables indicating that the sound signal is not processed as a

2In the physical sense.

7


raw unknown piece of data [Massaro and Cohen, 1983]3. High-level visual processingis also involved in the perception of human figures from point light displays, enablingsubjects to identify gender and identity from very sparse visual information [Thorntonet al., 1998].

The idea of top-down processing has been investigated and utilized in the past, most no-tably in object and gesture recognition systems. Bregler demonstrated a convincing appli-cation of this concept and applied it to action recognition in the visual field [Bregler, 1997].Wren and Pentland created a robust human dynamic recognition and classification systemby feeding likelihood data from high-level HMM procedures to pixel-level classifiers [Wrenand Pentland, 1997,Wren et al., 2000]. Similarly, Hamdan et al. classified gesture sequencesusing Continuous Density Hidden Markov Models [Hamdan et al., 1999]. Several collab-orative planning systems integrated similar concepts in amodal settings (e.g. [Rich et al.,2001]). However, the application of an integrated top-down implementation in an action-driven robotic behavior system is yet to be attempted.

Another crucial lesson stemming from the psychological evidence is that concepts arenot abstractions separated from their modal representation, but instead rely on the dy-namic interaction of modus-specific activation within and — perhaps more importantly —between modalities. In a computational system, this thesis has received strong supportthrough the work of Roy and Pentland, showing a significant improvement in robustnessof concept learning when multiple modalities were used [Roy and Pentland, 2002].

This thesis will collectively use the term “embodied cognition” to relate to the effect andinterrelation of the above mentioned elements: (a) perceptual symbol systems, (b) integra-tion between perception and action, and (c) top-down processing. I use this overarchingterm to denote an approach that views mental processes not as amodal semantic symbolprocessors with perceptual inputs and motor outputs, but as integrated psycho-physicalsystems acting as indivisible wholes.

It is worthwhile to note that similar convictions have long been held by theorists andteachers of stage acting, who stress a psycho-physical unity underlying acting methodol-ogy. As early as the late 19th century, Francois DelSarte noted that a “perfect reproductionof the outer manifestation of some passion, the giving of the outer sign, will cause a re-flex feeling within.” [Stebbins, 1887] With the rise of Stanislavskian method, this interre-lation becomes even more prominent, and most teachers would probably agree with Boalin saying that “ideas, emotions and sensations are all indissolubly interwoven. A bodilymovement ‘is’ a thought and a thought expresses itself in corporeal form.” [Boal, 2002]

This thesis proposes to take a similar approach to designing a cognitive architecture forrobots acting with human counterparts, be it teammates or scene partners. It is founded onthe hope that grounding cognition in perception and action can hold a key to socially ap-propriate behavior in a robotic agent, as well as to context and temporally precise human-robot collaboration, enabling hitherto unattained fluidity in this setting.

3As cited in [Wilson, 2001].

8


Joint Action

Humans are exceptionally good at working in teams, ranging from the seemingly trivial(i.e. jointly moving a table through a doorway), to the complex (as in sports teams orcorporations). What characteristics must a member of a team display to allow for sharedactivity? What rules govern the creation and maintenance of teamwork? And how does agroup of teammates form individual intentions aimed to achieve a joint goal, resulting in ashared activity?

Joint action can best be described as doing something as a team where the participantsshare the same goal and a common plan of execution. The workings of this inherentlysocial behavior have been of increasing interest to researchers in many fields over the pastdecade. Grosz—among others—has pointed out that collaborative plans do not reduceto the sum of the individual plans, but consist of an interplay of actions that can only beunderstood as part of the joint activity. [Grosz, 1996]

For example, if we were to move a table jointly through a doorway, your picking upone side of the table and starting to walk through the door does not make sense outsideour joint activity. Even the sum of both our picking-up and moving actions would notamount to the shared activity without the existence of a collaborative plan that both of usare sharing (namely to move the table out the door), as well as the presence of tight actioncoordination before and during task execution.

The conceptual relationship between individual intentions and joint intentions is notstraightforward, and several models have been proposed to explain how joint intentionsrelate to individual intentions and actions. Searle argues that collective intentions are notreducible to individual intentions of the agents involved, and that the individual acts existsolely in their role as part of the common goal. [Searle, 1990].

In Bratman’s detailed analysis of Shared Cooperative Activity (SCA), he defines certainprerequisites for an activity to be considered shared and cooperative [Bratman, 1992]. Hestresses the importance of mutual responsiveness, commitment to the joint activity and commit-ment to mutual support. His work also introduces the idea of meshing singular sub-plansinto a joint activity.

The Bratman prerequisites guarantee the robustness of the joint activity under changingconditions. In the table-moving example, mutual responsiveness ensures that our move-ments are synchronized; a commitment to the joint activity reassures each teammate thatthe other will not at some point drop his side; and a commitment to mutual support dealswith possible breakdowns due to one teammate’s inability to perform part of the plan. Brat-man shows that activities that do not display all of these prerequisites cannot necessarilybe viewed as teamwork.

Supporting Bratman’s guidelines, Cohen and Levesque propose a formal approach tobuilding artificial collaborative agents [Cohen and Levesque, 1991]. Their notion of jointintention is viewed not only as a persistent commitment of the team to a shared goal, butalso implies a commitment on part of all its members to a mutual belief about the state of

9


the goal. Teammates are committed to inform the team when they reach the conclusionthat a goal is achievable, impossible, or irrelevant. In our table-moving example, if oneteam member reaches the conclusion that the table will not fit through the doorway, it isan essential part of the implicit collaborative “contract”, to have an intention to make thisknowledge common. In a collaboration, agents can count on the commitment of othermembers, first to the goal and then—if necessary—to the mutual belief of the status of thegoal. For a more detailed description, see [Cohen and Levesque, 1991].

Joint Intention Theory predicts that an efficient and robust collaboration scheme in achanging environment requires an open channel of communication. Sharing informationthrough communication acts is critical given that each teammate often has only partialknowledge relevant to solving the problem, different capabilities, and possibly divergingbeliefs about the state of the task.

The above suggests that a central feature of any collaborative interaction is the estab-lishment and maintenance of common ground, defined by Clark as “the sum of [...] mutual,common, or joint knowledge, beliefs, or suppositions” [Clark, 1996]. Common ground isnecessary with respect to the objects of the task, the task state, and the internal states of theteam members.

Common ground about a certain proposition p is believed by Clark to rely on a sharedbase b, that both suggests p and the common knowledge of b. People sharing a commonground must be mutually aware of this shared basis and assume that everyone else in thecommunity are also aware of this basis. A shared basis can be thought of as a signal, bothindicating the proposition p, and being mutually accessible to all agents involved. Clarkcoins this idea the principle of justification, indicating that members of a shared activity needto be specifically aware of what the basis for their common ground is. [Clark, 1996]

This leads to the introduction of so-called coordination devices that serve as shared basis,establishing common ground. Coordination devices often include jointly salient events,such as gestural indications, obvious activities of one of the agents, or salient perceptualevents (such as a loud scream, or a visibly flashing light).

Finally, an important type of reaching common ground inherent to teamwork is theprinciple of joint closure. Joint closure on a sub-activity is needed to advance a sharedactivity. Until joint closure is achieved, the sub-activity cannot be regarded as complete bya team. Joint closure is described as“participants in a joint action trying to establish themutual belief that they have succeeded well enough for current purposes” [Clark, 1996].

To achieve joint closure, Clark argues that a contribution, i.e. a signal successfully under-stood, is usually made up of a presentation phase and a acceptance phase. For the contribu-tion to be concluded, the presenter needs evidence of understanding on the interlocutor’spart.

Both the principle of shared basis, and the principle of joint closure and justification callfor the importance of mechanisms of shared attention between participants of a shared task.Shared attention is predominantly a nonverbal process. Thus, a robot working togetherwith a human and following the principles of common ground must be able to understand

10


nonverbal signals indicating a shared object of attention, as well as establish joint attentionwith the human through its own nonverbal behavior.

Another key capability of establishing correctly meshed joint action is the correct antic-ipation of a team member’s actions. This anticipation occurs on several levels in human-human joint action. On the activity level, humans are biased to use an intention-basedpsychology to interpret other agent’s actions [Dennett, 1987]. Moreover, it has been shownin a variety of experimental settings that from an early age we interpret intentions andactions based on intended goals rather than specific activities or motion trajectories [Wood-ward et al., 2001, Gleissner et al., 2000, Baldwin and Baird, 2001]. A person opening a door,once by using her right hand and once by using her left hand, is considered to have per-formed the same action, regardless of the different motion trajectories. But this person isdoing a distinctly different action if she is opening a tap, even though the motion might bevery similar to the one used for opening the door. To a human spectator, the goal of theaction is often more salient than the physical attributes of it.

In teamwork, goals provide common ground for communication and interaction. Team-mates’ goals need to be understood and evaluated, and joint goals coordinated. Whenco-planning with a human team member, a robotic agent needs to take goals and theirachievement into account when anticipating what the human’s next action is probable tobe.

On an action level, successful coordinated action has been linked to the formation ofexpectations of each partner’s actions by the other [Flanagan and Johansson, 2003], andthe subsequent acting on these expectations [Knoblich and Jordan, 2003]. In neural terms,motor resonance (the activation of motor areas when perceiving a similar motion from aconspecific) seems to be involved in motor prediction. A suggested mechanism for thisis thought to be one of emulators, which can be thought of as equivalent to the simula-tors mentioned in the previous section, but operating one step ahead of current perceptualprocessing [Wilson and Knoblich, 2005]. Naturally, these emulators also play a role in thetop-down perceptual processes discussed above.

Perhaps surprisingly, while the existence and complexity of joint action has been ac-knowledged for decades, the neuro-cognitive mechanisms underlying it have only receivedsparse attention, predominantly over the last few years (for a review, see [Sebanz et al.,2006]). Much of the innovative work being done in the exploration of two humans sharinga task focuses on embodied structures of cognition, and recent work in neural imaging hasrepeatedly shown evidence for anticipatory structures which come into play during jointaction between two humans [Sebanz et al., 2006]. This thesis proposes to adopt a similarapproach for human-robot joint action.

11


Acting Method

One of the goals of this thesis is to develop a framework for a joint human-robot theatricalperformance, which will display some of the same tightly meshed joint action character-istics discussed in this proposal. To that end, this work turns to acting theory and prac-tice, according to the prevalent school today, the Stanislavsky system, also known as theStanislavsky method, or simply “the method” [Moore, 1968].

Similar to the underlying belief of embodied cognition coupled with a motivation-basedintentional action selection framework laid out in this thesis, “method” acting places anemphasis on body-based memory and objective-centric action selection. In this section, Iwill outline the elements most pertinent to the proposed computational architecture.

A written theatrical scene is analyzed in preparation by an actor with respect to thecharacter’s complete story arc and background information. A character is said to have ansuperobjective, also known as character objective, in any given play.

In each scene, a character follows a scene objective, which is derived from the overarchinggoal and drives every one of their actions. This objective is either satisfied or not satisfied atthe end of the scene. A scene objective is said to be immediate, personal, and selfish. Itis read through the character’s reaction to the scene partner’s actions with respect to thecharacter’s scene objective.

Within a scene, the character attempts to achieve the scene objective by a series of ac-tions, which are transitive, tactical, and always pertain to the scene partner. Each actionadvances the scene objective through its own linear objective.

For example, a possible action to achieve the scene objective of “attaining forgivenessfrom x” could be “flattering x” or “humiliating oneself”.

Scene actions are often not successful, due to obstacles, which can stem from the charac-ter themselves, from the environment, or from the scene partner. This requires the characterto try another action (such as “threatening x”) to achieve the scene objective. Naturally, theselection of actions is highly dependent on the character’s scene partner’s actions [Meisnerand Longwell, 1987].

The final breakdown is from actions to activities, which are the physical manifestations(verbal or nonverbal) of the current action.

Scene actions are closely tied to the character’s — and the actor’s — embodied memory,evoking the appropriate nonverbal behavior to accompany a scene’s delivery of a certainline.

Also, the pursuit of scene objectives and actions does not happen only on cue, but isinstead continuous even between cues [Moore, 1968,Hoffman, 2005b], and synchronized topre-cue impulses [Meisner and Longwell, 1987] which are in response to the scene partner’sactions.

12


Related Work: Synthetic Theatrical Agents

As noted, there has been little work using autonomous robots in theatrical performance.In the past decade, however, significant efforts have been made in the field of artificialtheatrical agents and synthetic performance characters.

The Oz project developed tools to construct simulated worlds including characters whichdisplayed goal-directed behavior, emotions, and a sensory model [Bates et al., 1994]. Ex-tensions of this work included the handling of plots, music, and scenery modeled as agentsas well [Reilly, 1995].

A few years later, perhaps the most significant work of its time dealing with the au-thoring of a theatrical performance was presented with IMPROV, which included tools toscript parametric layered animation, action, and behavior systems and resulted in a im-provisational environment which could sustain long-running performances thanks in partto a probabilistic action selection mechanism [Perlin and Goldberg, 1996]. However, thissystem still falls short of defining a cognitive architecture for synthetic characters, and alsodid not deal with the issue of action timing and their duration.

Several researchers at the MIT Media Lab addressed some of these shortcomings. Pin-hanez used an extension of interval algebra to stage synthetic character performances in-cluding sophisticated interrelations between the characters’ action durations and timing, aswell as methods to parse a human scene partner’s actions into a scripted scene [Pinhanez,1999, Pinhanez and Bobick, 2002]. Some of these ideas have been extended by Brooks andincorporated into work shaping a robot’s performance with regard to the timing and affec-tive qualities of action selection and execution [Brooks, 2006].

Becker, Wren, Pentland, and others have used the above-mentioned top-down gesturerecognition techniques to create multimodally perceptive and expressive virtual environ-ments [Becker and Pentland, 1996,Wren et al., 1999], among them TheaterSpace, a perceptivedance and theater stage for single performers.

Blumberg and his collaborator devised a fuller cognitive architecture for synthetic char-acters which were used — among others — to produce interactive performances [Maeset al., 1995, Blumberg, 1999, Burke et al., 2001]. Portions of this work have been conceivedand extended by Downie and, in recent years, supported a large number of interactivemusical, dance, and abstract graphical performances [Downie, 2005].

In recent years, a number of other works have dealt with various aspects of creatingsynthetic performances, among them efforts to create human-computer interfaces enablingthe design and shaping of synthetic characters’ performances [Dontcheva et al., 2003];“adverb”-like modulation of animations [Burke et al., 2001, Vala et al., 2003]; tools allow-ing a director to control an interaction based on a pre-authored plot under varying userbehavior [Magerko and Laird, 2003]; and other systems for authoring goal-driven interac-tive characters [Loyall et al., 2004, Geigel and Schweppe, 2004]. Some aspects of this workhas also been used for simulated characters outside the theatrical domain, for example inmilitary training scenarios [Swartout et al., 2005].

13


3 Prior Work

In prior work, my colleagues and I have viewed the interaction between human and robotteammates as a balanced partnership [Breazeal et al., 2004, Hoffman and Breazeal, 2004],informed by the theoretical framework provided by Collaborative Discourse Theory andShared Plans [Grosz, 1996], Joint Intention Theory [Cohen and Levesque, 1991], and SharedCooperative Action [Bratman, 1992]. We demonstrated human-robot teamwork on a hier-archical joint task by utilizing an array of non-verbal mechanisms, such as gaze, posture,gestures, and head movements to facilitate interaction and assist error correction. Our ar-chitecture design was based on findings in social and behavioral psychology, which — asoutlined in previous sections — emphasize the maintenance of common ground for thesuccessful execution of shared tasks [Clark, 1996]. As part of this work, we investigatedturn-taking mechanisms, mutual support, and the dynamic meshing of subplans betweenmembers of a team. We have also demonstrated the effect of non-verbal behaviors onhuman-robot interaction when the robot interacts with untrained human subjects [Breazealet al., 2005] (Figure 1).

Figure 1: In human-robot teamwork experiments gaze is used to maintain common ground,and gestures to coordinate turn taking and mutual support.

While this work tackles some of the social and non-verbal aspects of human-robot team-work, the interaction in these systems is still very much turn-based.

In later work, we investigated the improvements in efficiency and fluidity in synchronoushuman-robot teamwork, and the effects anticipatory action has on these [Hoffman andBreazeal, 2006b]. We have used a simulated environment in which human subjects collab-orated with a robot on a cart-building task (Figure 2). The roles were divided such that thehuman carried the cart-parts while the robot fetched the appropriate tools to join the parts.

14


Over the course of repetitive executions, the robot increasingly adapted to the human’sperformance and applied a cost-based anticipatory action selection policy to mesh its ac-tions with those of the human. We found significant improvement in efficiency, as well asin perceived fluidity and commitment on the part of the robot when studying untrainedhumans working with a simulated version of our robot [Hoffman and Breazeal, 2006a].

Figure 2: Simulation environment used to test the effect of anticipatory action on percep-tion of a robotic team member’s fluency and commitment to a shared task with a humancollaborator.

As part of the research architecture proposed below, I have begun implementing aperception-based associative memory model which supports top-down biasing of lower-level perceptual models. In the example depicted in Figure 3, an auditory percept (thesound “Elmo”) activates a canonical visual memory of the figure, which — using the samepathway utilized in visual perception — detects the dominant color of the image. Thiscolor is then used as a bias affecting the low-level visual buffer, shifting it towards detec-tion of similarly colored areas, eventually priming the system to detect the Elmo puppetin the visual field more easily. This alternative interpretation of the concept of anticipa-tory behavior illustrates the potential perceptual capabilities of a simulator-based memorymodel.

15


Figure 3: Top-down processing and cross-modal activation in a perceptual-based associa-tive memory model.

4 Research Architecture

This section describes the main components of the proposed research architecture, whichstands on three main pillars, based on the theoretical framework outlined above: (a) perception-based memory; (b) action-perception activation networks; and (c) a motivation- and goal-based action selection mechanism. These three concepts are highly interrelated: the per-ceptual memory model supports top-down simulator- and emulator-type biasing, which isactivated — among others — by the action-perception activation network connections. Themotivation/goal layer both builds on the perception-action layer by serving as a recoursefor unresolved goals, and conversely as a trigger for simulator- and emulator-type biasingas part of the action selection mechanism.

Perception-Based Memory

Grounded in perceptual symbol theories of cognition, and in particular that of simulators[Barsalou, 1999], memories and concepts are based on perceptual snapshots and residein the various perceptual systems rather than in an amodal centralized semantic network(Figure 5). Incoming perceptions are organized in percept trees [Blumberg et al., 2002] foreach modality, organizing them on a gradient of specificity (Figure 4).

Memories are organized in activation networks, connecting them to each other, as wellas to concepts, actions, plans, and intentions (dotted lines in Figure 5). However, it is impor-tant to note that connections between perceptual, conceptual, and action memory are notbinary. Instead, a single perceptual memory—for example the sound of a cat meowing—can be more strongly connected to one memory—like the proprioceptive memory of tug-

16


LOCATION SHAPE

CIRCLE

COLOR

RED

BUTTON-SHAPE

VISUAL BUFFER

Figure 4: Percept trees organize sensory experience on a gradient of specificity.

ging at a cat’s tail—or more weakly connected to another memory, such as that of one’schildhood living room scent.

Perceptions are retained in memory, but decay over time. Instead of subscribing tothe classic division of short- and long-term memory, the proposed architecture advocatesa gradient information-based decay in which more specific perceptual memory decaysfaster, and more general perceptual information is retained longer. Decay is governedby the amount of storage needed to keep perceptual memory at various levels of speci-ficity, retaining more specific memories for a shorter period of time, and more abstract—orfiltered—perceptions longer.

These retention patterns are also influenced by the relevance of a piece of perceptual in-formation to the current task and its attentive saliency: perceptual elements that are specif-ically attended to at time of acquisition will be retained longer than ones that were periph-eral at that time. In a similar vein, affective states of the system also influence memoryretention4.

For example in the visual processing pathway, raw images are retained for a short pe-riod of time, while colors, line orientations, blob location, and fully recognized objects re-main in memory for a longer time span (Figure 6).

Simulators and Production

A key capability of perceptual memory is the production of new perceptual patterns andconcepts. Using simulation, the activation of a perceptual symbol can evoke the construc-tion of both a past and a fictional situation. Citing [Barsalou, 1999]: “Productivity in percep-tual symbol systems is approximately the symbol formation process run in reverse. Duringsymbol formation, large amounts of information are filtered out of perceptual representa-tions to form a schematic representation of a selected aspect. During productivity, small

4Such an approach could explain why certain marginal perceptions are retained for a long time if they occurredin a traumatic setting.

17


Visual Perception

Visual Sensors

Visual Perception Memory

Auditory Perception

Auditory Sensors

Auditory Perception Memory

Motor Perception

Motor Sensors

Motor Perception Memory

. . .

. . .

Actions

weighted activation connectiondecaying perceptual memory

percept tree

motor action

Figure 5: Perception-Based Memory

18


specificity

timenow

Figure 6: Perceptual memory is filtered up in specificity and remains in memory for avariable amount of time, based—among others—on the information requirements at eachspecificity level.

amounts of the information filtered out are added back.”

One mechanism enabling the top-down processing of perceptual memory is the implicitconnection between feature-detecting percepts within a certain modality. As illustrated inFigure 7, an activation of a red detector in a perceptual concept implicitly activates otherpercepts which have activated the same node (as these percept nodes model the sameunderlying neural structure, analogous to what Simmons and Barsalou denote AnalyticConvergence Zones [Simmons and Barsalou, 2003]). Based on the weight of the connectionbetween the activated feature and the perceptual snapshot, different concepts are activatedto a varying extent.

A second activation mechanism operates on a higher level, linking percepts of differentmodalities using weighted connections (Figure 8). This corresponds roughly to the conceptof Cross-modality Convergence Zones [Simmons and Barsalou, 2003]. Weights of connectivityare learned and refined over time. Their value is influenced by a number of factors, suchas frequency of co-ocurrance, attention while creating the perceptual memory, and affec-tive states during perception. Roy and Pentland’s work in multimodal concept learningsuggests insight into the potential acquisition of these intermodal connections [Roy andPentland, 2002].

In order to resolve a multitude of (possibly contradictory) perceptual activations usinglimited processing capacity, perceptual memories can inhibit one another. A more domi-

19


COLOR

RED

RED

Figure 7: Simulation activates percepts if an active node was triggered in another percept.

VISUAL AUDITORY

“ELMO”

Figure 8: Percept trees in different modalities are interconnected.

nant activation of a certain perceptual symbol (for example — the sound “nose” activatingimagery of a human nose rather than an airplane’s nose) will inhibit a less dominant activa-tion. However, to enable less dominant activations after a certain time (based, for example,on the monitoring of higher level goals), a self-inhibiting process can be used to decay thedominance of active perceptual symbols.

Emulators

Based on behavioral and neurological finding in humans — and of their apparent impor-tance to joint action — perceptual activation and simulation does not only occur in ret-rospect, but also prospectively, in forms of so-called “emulators” [Wilson and Knoblich,2005]. In the proposed system, emulators are modeled as a parallel perceptual activationsystem projecting expectations of perceptual experiences at an atomic time scale.

20


AUDITORY

“NOSE”

TEMPORAL

INHIBITION

Example based on Barsalou, 2003

Figure 9: Competing perceptual symbols are incrementally resolved using dominant sym-bol self-inhibitory decay.

Note that the concept of anticipatory action and emulators is tightly related to thepremise of top-down processing. The benefits of using similar notions of expectation in asimulator-like fashion, driving perceptual processing, has been successfully demonstratedin the gesture-recognition field by several researchers [Bregler, 1997, Wren et al., 1999].

Action-Perception Activation Networks

Not only perceptions are situated in weighted activation networks, but these are also con-nected to motor actions (see: Figure 5). This is in line with the action-perception integra-tion approach laid out earlier. Inspired among others by architectures such as ContentionScheduling [Cooper and Shallice, 2000], activities and perceptual concepts occur in a per-petually updating activation relationship.

To elaborate on the operation of such networks: current perceptions exert a weighted in-fluence on activities, leading both to a potential action selection, as well as to the activationof an attention mechanism which in turn refines the perceptual task, and aids its success.Thus, for example, the presentation of a screwdriver may activate a grasping action, as wellas the activity of applying a screwdriver to a screw. This will activate a perceptual simula-tor guiding the visual search towards a screw-like object, and the motor memory related toa clockwise rotation of the wrist.

In accordance with behavioral evidence [McLeod and Posner, 1984, Wilson, 2001], cer-tain modalities of perception are more strongly connected to particular motor subsystems,resulting in so-called privileged loops between perception and action. Disregarding to the dis-cussion of innateness or development of such connections, these “loops” are found in hu-mans to exist, for example, between auditory perception and speech production, as well as

21


PERCEPTUAL

PATTERNS

ACTIONS

Figure 10: Perceptual symbols and motor activity are interconnected with weighted activa-tions.

between the perception of conspecifics’ actions and own body motor system. This thesisproposes that these mechanisms play a significant role in practiced activity and are instru-mental to the quick response time necessitated by fluent joint action.

Specifically, it is a hypothesis of this thesis, that such privileged loops as well as otherhighly weighted connections within and between perceptual and motor activation patternsgive rise to some of the nonverbal behavior that is necessary for fluent joint action. Thetime-shortest path from a perception to an action might “lead through” a certain set ofperceptual and motor activation, which may double as a coordination device for the swiftsuccess of a joint activity.

AUDITORY

SPEECH

VISUAL

BODY-MOTOR

MOTION

EYE-MOTOR

Figure 11: Stronger connections between perception and action result in so-called privilegedloops.

22


Prediction and Anticipated Action

Among other factors, successful coordinated action has been linked to the formation ofexpectations of each partner’s actions by the other [Flanagan and Johansson, 2003], and thesubsequent acting on these expectations [Knoblich and Jordan, 2003]. We have presentedinitial work aimed at understanding possible mechanisms to achieve this behavior in acollaborative robot [Hoffman and Breazeal, 2006a]. A further research goal is to evaluatewhether more fluid anticipatory action can emerge from an approach that uses perception-action activation as its underlying principle, and what directive role anticipation can playin the formation of perceptual procession and fluent joint action.

In the proposed framework, anticipation appears on the perceptual level by the use ofsimulators, biasing incoming perceptual data; and emulators, predictively building per-ceptual models of temporal sequences. With the introduction of action sequences, antici-pation can also be formed by the cross-activation of perception and action, possibly usinga Bayesian model of action transitions, and preemptively activating perceptual symbolsbased on the anticipation of actions in such sequences.

By modeling goals and intentions of the human collaborator, using a simulation-theoreticframework (as, for example, described in [Gray, 2004]), anticipation can occur on highertask levels, “trickling down” through action detection mechanisms to the action-perceptionactivation network.

Intentions, Motivations, and Supervision

It should be noted that much of the above architecture is predominantly apt for govern-ing automatic or routine activity. Perceptions precipitate concepts and object memories,which in turn prompt actions governing future perception and motor activation. While thismodel can be adequate for well-established behavior, a robot acting jointly with a humanmust also behave in concordance with internal motivations and intentions, and higher-level supervision. This is particularly crucial because humans naturally assign internalstates and intentions to animate and even inanimate objects [Baldwin and Baird, 2001,Den-nett, 1987, Malle et al., 2001]. Robots acting with people must behave according to internaldrives as well as clearly communicate these drives to their human counterpart. This viewis also supported in modern acting literature supporting “inner monologue” as a pillar ofobjective-driven acting (see e.g. [Moore, 1968]).

In the proposed architecture, supervisory intention- and motivation-based control af-fects the autonomic processing scheme outlined above (Figure 12). At the base of this su-pervisory system lie core drives, such as hunger, boredom, attention-seeking, and otherdomain-specific drives (such as task success), modeled as scalar fluents, which the agentseeks to maintain at an optimal level (similar to [Breazeal, 2002]). If any of those fluentsfalls above or below the defined range, they trigger a goal request. The number and iden-tity of the agent’s motivational drives are fixed. Goals are represented as a vector indexed

23


Action-Perception Activation Network

Pose Graph

Goal Recipes

Drives / Motivation

Goal-Action Planner

Action Space

Motor Graph

Learned Recipes

Learned Actions

Learned Motion Trajectories

Motor Control

Goal Arbitration

Planning

Perception

Figure 12: Motivational drives are used to form plans overriding or swaying action selec-tion in the Action-Perception Activation Network.

to the various motivational drives, and encoding the effect of achieving a goal on each ofthe drives. This representation can be used by the Goal Arbitration module to decide whatgoal to pursue. Under the assumption that goals are mutually exclusive at any point, therelative deviance of the motivational drive from its optimal value in combination with theexpected effect can be used to compute a utility function for selecting each goal. The mostuseful goal according to this function is thus selected.

Once a goal is selected, a planner is used to construct a plan that satisfies the goal. Aplan can then activate or override the automatic action selection mechanisms in the action-perception activation network, activating simulators and guiding perception and motoractivity, according to the principles laid out in [Cooper and Shallice, 2000]. The outcome ofactions is fed back through perceptual acquisition to the motivational system.

Note that this model can also be used as a basis for an experienced-based intentionreading framework, as outlined in [Hoffman, 2005a].

24


Practice

According to the model put forth here, practice operates on two levels: it alters the weightsof connections within the perception and motor activation networks and thus transfersdeliberate action selection procedures (stemming from the motivational layer) to a fasteraction activation mechanism; practice also helps shape the anticipatory action selectionmechanism by reinforcing transition probabilities and time estimates which in turn arebeing used by emulators to predict future perceptual and motor operations.

Theatrical Scene Analysis

A final component of the proposed architecture is specific to the theatrical human-robotdialog intended in this thesis, based on insights from the Stanislavsky system for actorpreparation. I propose an authoring framework for human-robot theater which will enablea robot using the above cognitive architecture to engage in a theatrical performance with ahuman scene partner.

In preparation for a performance with a human, a scene is analyzed in terms of its sceneobjective and broken down into the specific actions5 attempted by the robot. Each of theseactions is tied to a set of perceptual symbols, which in turn are connected to motor actionsas described above. When a scene is played out, these actions are invoked, intended to leadto appropriate perceptual and motor behavior.

In addition, cues (tied to script points) and impulses (in terms of the scene partner’sactions) are marked in the scene representation. Impulses are represented as changes inthe motivational system triggering a perceivable change in the robot’s behavior. Note thatimpulses are a variant of the anticipatory action behavior described above, and will besubject to temporal refinement with rehearsal.

Action accomplishments and failures are fed back to the motivational system, as de-scribed in the previous section, modifying subsequent performances.

5The use of the term action here is in the Stanislavskian Method sense, not in the sense used above in thedescription of action-perception networks. The above mentioned actions would correspond to activities in Methodacting.

25


5 Research Derivatives

The approach presented in this proposal involves the consideration of a number of im-portant cognitive mechanisms that underly successful human-robot joint action. It is asecondary goal of my thesis to investigate their operation under the proposed approach:

Attention The generic “perception problem” is quite intractable. Humans (as well as otheranimals) use attention systems to filter incoming data, and attention has also proveduseful for robotic agents [Breazeal and Scassellati, 1999]. In the framework advocatedherein, attention emerges from the interaction between actions and perceptual con-cept, both in simulation and in anticipation. As described above, attention converselyalso determines the relationships between elements in perceptual memory.

Routine Action Performing a well-known task is likely to involve non-conscious and sub-planning action selection mechanisms. In humans, routine action performance is of-ten found to be faster than conscious or planned activity. In this thesis I would like toexplore the relationship between routine and supervised action. I would be interestedin investigation the connection between embodied cognition and routine activity, andin particular the effect of these mechanisms on human-robot interaction.

Practice In direct relation to routine activity, practice is an important facet of joint action,and of any kind of performance art. The mechanisms underlying repeated practiceof a task, and in particular joint practice to achieve fluid collaboration are not wellunderstood. I would like to examine the use of the proposed model on the under-standing of how practice improved joint acting, beyond the aspects which have beeninvestigated in my prior work [Hoffman and Breazeal, 2006a].

Flexible Resolution Cognition often appears to conduct itself on a ‘need-to-know’ or ‘need-to-do’ basis. It can be argued that planning and adjustment of trained conceptsand action schemas only occurs when well-practiced routine activation models fail.Similarly, planning, while informed by lower-level perceptual phenomena, does notneed to delve into percept-level resolution unless more information than available isneeded. Perception and action can rely on existing patterns and routine simulationsunless there is a significant mismatch between the expected and the observed. In thiswork I would like to seek the role of a simulator architecture as part of an agent op-erating at a dynamic and flexible resolution—moving between low-level processingand high-level planning as needed.

Role of Nonverbal Behavior Nonverbal behavior plays a significant role in the coordina-tion of joint action. However, some research indicates that much of it is non-voluntaryand distinct from the usual communicative function of verbal communication. Aspart of this thesis I would like to investigate what role a physically based memoryevoking nonverbal behavior as part of perception, planning, and decision making,could play in human-robot joint action.

26


6 Demonstrations

Two demonstrations will serve as platforms to evaluate the approach and investigate theconcepts laid forth in this thesis. The first will be a human-robot team working togetheron a shared location task; the second is in the entertainment realm, staging a theatricalscene involving a human and a robot.

Human-robot Teamwork

One implementation of the proposed architecture is shared-location human-robot collabo-ration. A number of scenarios — from personal applications, through military use, to spaceexploration — envision robots working together with humans on shared tasks. Most of theproposed systems and existing implementation are still a long way from what we meanby teamwork when talking about humans working with other humans. It is my belief thatthe fluid meshing of actions is key to the success of such future robotic team members, andthat this can be achieved through the use of an embodied cognitive architecture as laid outin this thesis.

Platform

The platform for the teamwork demonstration will be a robotic lighting assistant, collabo-ratively illuminating a scene for a human working on a bi-manual construction task (Fig-ure 13).

In human collaboration it is often the case that an assistant to a person engaged in atask aids the performance by illuminating the task-relevant area. To appropriately directa lighting beam towards a shared task location is a highly embodied mission which takesinto account the actions and goals of the human team member, their perspective on the task,and provides an interesting space for improvement by practice, as well as the employmentof anticipatory actions. Much like in the case of a human lighting assistant, the robot’s lightbeam can also be used as a deictic gesturing device, pointing out aspects of the task thatthe operator needs to be aware of.

The prospect of a robotic illuminating assistant also solves a recurring real-world prob-lem, which — at the moment — does not have a satisfactory automated solution. Advancestowards this goal could be of interested to a number of real-world applications, such asspace station maintenance, deep sea exploration, surgery, machine repair, and others.

In an effort to focus this research on the behavioral aspects of joint action, the proposedrobotic lighting assistant will be built upon an existing robotic platform in our lab, RoCo —a robotic computer monitor (Figure 14). RoCo has a morphology that supports the task athand, and also a versatile mount allowing for the attachment of a robotic lighting head. Atthe time of writing, the software infrastructure supporting RoCo’s motor control is com-pleted.

27


Figure 13: Collaborative robotic lighting assistant (concept sketch).

A new head will be designed as part of this thesis to support two additional degrees offreedom. One DOF will be changing the beam radius, as both a task-oriented and deicticgesture, while the other DOF will alter the light intensity (and possibly color) of the lightbeam. I believe that it is possible to achieve a substantial level of expressiveness with theabove degrees of freedom.

Cognitive Implementation

The collaborative lighting implementation will employ two modalities, visual and audi-tory. The visual modality will will consist of either a two-camera system, utilizing gesturerecognition techniques in the vein of [Bobick and Davis, 2001], or a marker-based infraredarticulated tracking system for more accurate skeleton recognition. The auditory modalitywill operate on two levels: the audio waveform for analysis of onset, termination, and levelof speech segments, and a natural language recognition engine utilizing a limited grammar.

Perceptual concepts that are appropriate for the proposed task fall into three broad cat-egories: activities, objects, and relations. Human activities that can be learned and detectedare visually based on skeletal joint trajectories and possibly on the features of optical flow.To enable the envisioned fluency, the proposed cognitive implementation needs to operateon a sub-gesture level, allowing for perceptual activation at the onsets of recognizable ges-tures and human activities. Such activities could include pointing; picking up; fastening;pushing; etc. In addition, higher level task activities and plans can be modeled based onatomic actions.

Objects include a hierarchical taxonomy of parts, components, and systems comprisingthe target of the joint task, as well as the tools needed to complete the task. In an engine re-pair scenario these concepts could include parts such as bolts, washers, and machine parts;components such as the exhaust or alternator; tools; and the engine as a whole (althoughthe latter might not be necessary in a single task setting).

Finally, relationships between parts and tools, between tools, activities, and compo-

28


Figure 14: RoCo, a robotic computer monitor.

nents, and between objects can also be represented as perceptual concepts. This might wellbe the most challenging aspect of a perceptual symbol system, as these concepts are likelyto be comprised of perceptually diverse prototypes. Motor activity representations couldprove to be highly correlated to the detection and recognition of relationship concepts.

Evaluation

Due to the novelty of this work, there are few known metrics for joint action fluency. Defin-ing these is therefore part and parcel of the work proposed. An indirect measure of thisquality is the efficiency in the interaction, which we have previously assessed through over-all task time, number of errors, and error recovery timing [Breazeal et al., 2005]. I plan touse these and similar measures as indirect metrics in the proposed human-robot teamworkexperiment.

The various experimental conditions will be controlled with respect to the different ele-ments of the proposed system, for example:

1. Anticipatory vs. reactive action — In the anticipatory condition, the agent will makeaction decisions based on both past events and projected expectations. This will becompared to a condition in which the current action is derived only from past states.

29


2. Information-gradient vs. constant memory decay — In the information-gradientdecay condition, short term memory will decay at an information-dependent rate.High-information representations will decay faster than more abstract representa-tions. This will be compared to a condition in which memory decays at a commonrate, or falls off at a threshold.

3. Integration of motor system in perceptual processing vs. bottom-up action selec-tion — This manipulation can be viewed as a “lesion”-type study in which top-downaction-based bias on perceptual classification will be active in one condition, and in-active in the second condition.

4. Self-inhibitory decay — The use of self-inhibitory decay used to allow weaker per-ceptual links to be activated after a certain amount of time will be active in one con-dition and inactive in the second condition. Alternatively, the second condition couldinstead use a non-temporal mechanism of perceptual link enumeration.

In addition, self-report can be used to evaluate the sense of fluency in human subjects.Derived judgments of the robotic team member, such as trust, perceived commitment to thetask, perceived intelligence, etc. can also be evaluated through a self-report questionnaire.

I will also attempt to distill relevant metrics of fluency from a comprehensive interac-tion log. In previous work, a number of novel fluency metrics have been found to differsignificantly between joint action experimental conditions [Hoffman and Breazeal, 2006a].These included time delay between human and robot action, and rate of concurrent motion.The evaluation of novel metrics and their relationship to various aspects of the proposedarchitecture is crucial to the understanding of human-robot fluency.

Finally, in recent years a number of computational methods have been developed toevaluate human-human interaction based on non-verbal behavior. These methods — in-spired in part by a tradition of computational and Markovian analysis of human interactionin behavioral psychology (e.g. [Thomas and Martin, 1976]) — rely on pattern recognition ofhuman dynamics such as prosody, timing, and turn-taking, and manage to capture quali-tative aspects of interaction with a high degree of accuracy [Pentland, 2004,Pentland et al.,2005]. A possible metric for the success of the proposed research architecture could be acomparison between human-human and human-robot interaction in relation to such com-putational measures.

Since the effect of practice in repetitive tasks is a cornerstone of this work, I propose tomeasure the above metrics both across subjects and between repetitions of the same taskwithin a single subject set.

Human-robot Theatrical Performance

A joint performance between a human and a robotic actor will be staged as part of thisthesis. The robot will implement the embodied architecture described, and may include

30


additional synthetic theater methods, such as those proposed by [Perlin and Goldberg,1996] and [Pinhanez, 1999].

While there have been some examples of robotic actors, most have been teleoperatedor have been designed according to very simple interaction schemata. More complex sys-tems have been used in 2-dimensional virtual theatrical characters, but have not been ex-tended to the physical realm. A truly responsive robotic actor is therefore still a worthyand unattained research goal.

I also argue that robotic actors serve as a good platform for the investigation of syn-chronized human-robot behavior. Much like future consumer robots, robotic actors aremeant to be used in a non-repetitive fashion, and need to operate in close synchronizationwith untrained humans. This calls for robots that are easy-to-use, flexible, and adaptive.However, a theatrical performance affords many assumptions that can be made at the engi-neering level, leaving more room to seriously investigate the core cognitive challenges laidout in this work.

Cognitive Implementation

The cognitive implementation of a theatrical performance system poses a unique challenge,not unfamiliar to a human dramatic actor. On the one hand, the performance must followa prescribed script, and on the other it should appear internally motivated as well as re-alistically timed and expressed. As a result, the relationship between the outcome of aformal theatrical scene analysis and the cognitive architecture which engages in a mutuallyresponsive autonomous action selection mechanism promises to be an important facet ofthis research.

The perceptual concept set in this implementation shares some of the core features ofthat outlined in the previous section. Object, action, and relation concepts will be imple-mented to achieve a realistic and fluent performance. That said, it is likely that objectconcepts play a less significant role, as the task is not necessarily bound to a physicallypresent object in contrast to the shared-location teamwork implementation. However, arepresentation of non-present objects will be included to ground the actions and intentionsof scene members.

Since the performance will be scripted, additional concepts must be introduced andtheir relationship to the immediate perceptual systems investigated. Based on StanislavskianMethod, the characters’ objectives, attitudes (affective states), and actions will be part of theperceptual symbol network, as will be the actual activities and dialog lines as written in thescript, which can be modeled as motor actions.

In terms of the goals and motivations modeled in the theatrical performance implemen-tation, I propose to represent two kinds of achievement goals. The character’s goals are setas a result of the scene analysis, and are continually monitored by the system for activationof appropriate actions that are related to their fulfillment. In addition, a goal of the roboticperformer is related to the conclusion of each dialog segment and the progression of the

31


performed scene. This approach lends itself to an analysis common in collaborative dialogsystems, which emphasize the duality of intentional and linguistic goals (e.g. [Grosz andSidner, 1986]).

Evaluation

To evaluate the effectiveness of the proposed architecture for a human-robot performance,two actors will serve as scene partners in two separate performances, using the same scriptimplemented on the machine. Through the process of repeated rehearsal, I will attemptto demonstrate the increased adaptation of the robotic performance to each scene partner,resulting in tightly meshed, but clearly different performance, between both runs.

The evaluation of this demonstration will be through self-report, both of the participat-ing scene partners, and of audience members who will individually view a video of thetwo final performances.

The system’s final performance will be compared to the first run of the scene, beforerehearsal comes into play (using a controlled study). Audience members will evaluatewhether the performance is notably different between the two ensembles, as well as be-tween the first and last run of the scene, and will judge each performance according to anumber of subjective metrics.

These measures will then be compared to a non-adaptive condition with an equal numberof rehearsals. In this condition, the robot performs the same perceptive, memory, and actionselection mechanisms, but does no adapt to the scene partner.

Additionally, the proposed system can be compared to both a purely reactive system,in which the robot simply responds to scene cues, as well as a statically timed system, inwhich only the human adapts to the robot’s timing.

Interim Demonstrations

Some or all of the following interim demonstrations can serve on the path to the proposedimplementations.

1. To further evaluate the concept of anticipatory action, the framework described in[Hoffman and Breazeal, 2006a] can be implemented on a physical robot, demonstrat-ing increasingly meshed action timing achieved by cost-based anticipation.

2. To demonstrate the value of an embodied cognition architecture on the timing of jointaction, a human-robot musical duet is proposed. This experiment integrates threemodalities of perception with action, and could be implemented as an extension tointerim demonstration 1.

3. In addition, to demonstrate the value of a perceptual simulator based cognitive archi-tecture on human-robot joint action, the following short demonstrations have been

32


proposed by Breazeal and Barsalou (private communication, 2006) and are of valueto this thesis:

• The robot hears a linguistic description of a novel object never seen before, sim-ulates the object based on compositional linguistic abilities (combining simula-tions of individual word referents seen previously), and then uses the simulationto locate the object.

• A human instructor performs a procedure while describing it. While watchingand listening, the robot simulates doing the procedure itself, and later carries itout on verbal command.

• On perceiving a relevant cue that initiates an action sequence, the robot simu-lates the sequence and carries it out.

• A human is building something and doesn’t realize that a key component ismissing. The robot simulates what the agent is trying to do, realizes that thecomponent is missing, and provides it.

33


7 Resources

Equipment

Two robots and their supporting computers will be used in the demonstrations laid outabove. The Leonardo robot is an existing machine in our group, and all of its supporthardware is in place. The proposed robotic lamp will be built upon RoCo’s mechanicalinfrastructure. Additional fabrication (mechanics and electronics) will be necessary for thelamp head.

For development, the computers currently at the lab are sufficient for the software de-velopment, mechanical, and electronic design necessary for this thesis.

Human Subject Study

For the evaluation portion of this thesis, one or more human subject studies will be runwith an expected data set of under 50 participants each. A compensation of between $5and $10 per study subject will incur additional costs on this project.

UROP

The design of the robotic lamp head will require one full-time mechanical engineeringUROP and one part-time UROP majoring in electrical engineering for the fall semesterof 2006.

An additional full-time UROP will be hired for set-up and data-analysis of the humansubject studies, in the spring semester of 2007.

34


8 Timeline

The following timeline lays out the temporal position and extent of the various parts of theproposed thesis, for both myself and the UROPs involved in the project.

September October November 2006

December January February 20072006

March April May2007

Mech. E.

E.E.(Part Time)

URO

P Mech. Design HeadPrototype I

RevisionsPrototype II

Manufacturing

Mech. E.

URO

P

Assembly + Debug

Elec.. Design Head Revisions Revisions

E.E.(Part Time)

Final BoardMfg.

Prototype I Prototype II

Robotic Lamp Head

Motivations / Drives

Teamwork studiesTheater studies

Writing

Perceptual Memory Action-Perception Interim Demonstrations

Interim Demonstrations

Software support Practice /Rehearsal

Study

URO

PSetup Studies Analysis

Acting Method

Setup

StudyURO

P

Analysis

Analysis

Study

Cognitive Architecture

35


References

[Argall et al., 2006] Argall, B., Gu, Y., Browning, B., and Veloso, M. (2006). The first segwaysoccer experience: towards peer-to-peer human-robot teams. In HRI ’06: Proceeding ofthe 1st ACM SIGCHI/SIGART conference on Human-robot interaction, pages 321–322, NewYork, NY, USA. ACM Press.

[Baldwin and Baird, 2001] Baldwin, D. and Baird, J. (2001). Discerning intentions in dy-namic human action. Trends in Cognitive Sciences, 5(4):171–178.

[Barsalou, 1999] Barsalou, L. (1999). Perceptual symbol systems. Behavioral and Brain Sci-ences, 22:577–660.

[Barsalou et al., 2003] Barsalou, L. W., Niedenthal, P. M., Barbey, A., and Ruppert, J. (2003).Social embodiment. The Psychology of Learning and Motivation, 43:43–92.

[Bates et al., 1994] Bates, J., Loyall, A. B., and Reilly, W. S. (1994). An architecture for ac-tion, emotion, and social behavior. In MAAMAW ’92: Selected papers from the 4th EuropeanWorkshop on on Modelling Autonomous Agents in a Multi-Agent World, Artificial Social Sys-tems, pages 55–68. Springer-Verlag.

[Becker and Pentland, 1996] Becker, D. A. and Pentland, A. (1996). Using a virtual envi-ronment to teach cancer patients t’ai chi, relaxation and self-imagery. Technical ReportVisMod 390, MIT Media Laboratory.

[Bischoff and Graefe, 2004] Bischoff, R. and Graefe, V. (2004). Hermes – a versatile personalassistant robot. Proc. IEEE – Special Issue on Human Interactive Robots for PsychologicalEnrichment, pages 1759–1779.

[Blumberg, 1999] Blumberg, B. (1999). (void*): a cast of characters. In SIGGRAPH ’99:ACM SIGGRAPH 99 Conference abstracts and applications, page 169. ACM Press.

[Blumberg et al., 2002] Blumberg, B., Downie, M., Ivanov, Y., Berlin, M., Johnson, M. P.,and Tomlinson, B. (2002). Integrated learning for interactive synthetic characters. InSIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interac-tive techniques, pages 417–426. ACM Press.

[Boal, 2002] Boal, A. (2002). Games for Actors and Non-Actors. Routledge, 2nd edition.

[Bobick and Davis, 2001] Bobick, A. F. and Davis, J. W. (2001). The recognition of humanmovement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell., 23(3):257–267.

[Bratman, 1992] Bratman, M. (1992). Shared cooperative activity. The Philosophical Review,101(2):327–341.

[Breazeal, 2002] Breazeal, C. (2002). Designing Sociable Robots. MIT Press.

36


[Breazeal et al., 2004] Breazeal, C., Brooks, A., Chilongo, D., Gray, J., Hoffman, G., Kidd, C.,Lee, H., Lieberman, J., and Lockerd, A. (2004). Working collaboratively with humanoidrobots. In Proceedings of the IEEE-RAS/RSJ International Conference on Humanoid Robots(Humanoids 2004), Santa Monica, CA.

[Breazeal et al., 2005] Breazeal, C., Kidd, C. D., Thomaz, A. L., Hoffman, G., and Berlin, M.(2005). Effects of nonverbal communication on efficiency and robustness in human-robotteamwork. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),Edmonton, Alberta, Canada.

[Breazeal and Scassellati, 1999] Breazeal, C. and Scassellati, B. (1999). A context-dependentattention system for a social robot. In IJCAI ’99: Proceedings of the Sixteenth InternationalJoint Conference on Artificial Intelligence, pages 1146–1153. Morgan Kaufmann PublishersInc.

[Bregler, 1997] Bregler, C. (1997). Learning and recognizing human dynamics in video se-quences. In CVPR ’97: Proceedings of the 1997 Conference on Computer Vision and PatternRecognition, page 568. IEEE Computer Society.

[Brooks, 2006] Brooks, A. (2006). Coordinating Human-Robot Communication. PhD thesis,MIT, Cambridge, MA.

[Bruce et al., 2000] Bruce, A., Knight, J., Listopad, S., Magerko, B., and Nourbakhsh, I.(2000). Robot improv: Using drama to create believable agents. In Proceedings of ICRA2000, volume 4, pages 4002–4008.

[Burke et al., 2001] Burke, R., Isla, D., Downie, M., Ivanov, Y., and Blumberg, B. (2001).CreatureSmarts: The art and architecture of a virtual brain. In Proceedings of the GameDevelopers Conference, pages 147–166.

[Chovil, 1992] Chovil, N. (1992). Discourse-oriented facial displays in conversation. Re-search on Language and Social Interaction, 25:163–194.

[Clark, 1996] Clark, H. H. (1996). Using Language. Cambridge University Press, Cam-bridge, UK.

[Cohen and Levesque, 1991] Cohen, P. R. and Levesque, H. J. (1991). Teamwork. NOUS,35:487–512.

[Cooper and Shallice, 2000] Cooper, R. and Shallice, T. (2000). Contention scheduling andthe control of routing activities. Cognitive Neuropsychology, 17:297–338.

[Dennett, 1987] Dennett, D. C. (1987). Three kinds of intentional psychology. In The Inten-tional Stance, chapter 3. MIT Press, Cambridge, MA.

[Dixon, 2004] Dixon, S. (2004). Metal Performance: Humanizing Robots, Returning to Na-ture, and Camping About. The Drama Review, 48(4).

37


[Dontcheva et al., 2003] Dontcheva, M., Yngve, G., and Popovic, Z. (2003). Layered actingfor character animation. ACM Trans. Graph., 22(3):409–416.

[Downie, 2005] Downie, M. (2005). Choreographing the Extended Agent: Performance Graphicsfor Dance Theater. PhD thesis, MIT Media Lab, Cambridge, MA.

[Flanagan and Johansson, 2003] Flanagan, J. R. and Johansson, R. S. (2003). Action plansused in action observation. Nature, 424(6950):769–771.

[Fong et al., 2003] Fong, T. W., Thorpe, C., and Baur, C. (2003). Multi-robot remote drivingwith collaborative control. IEEE Transactions on Industrial Electronics.

[Gallese and Goldman, 1996] Gallese, V. and Goldman, A. (1996). Mirror neurons and thesimulation theory of mind-reading. Brain, 2(12):493–501.

[Geigel and Schweppe, 2004] Geigel, J. and Schweppe, M. (2004). Theatrical storytelling ina virtual space. In SRMC ’04: Proceedings of the 1st ACM workshop on Story representation,mechanism and context, pages 39–46. ACM Press.

[Gleissner et al., 2000] Gleissner, B., Meltzoff, A. N., and Bekkering, H. (2000). Children’scoding of human action: cognitive factors influencing imitation in 3-year-olds. Develop-mental Science, 3(4):405–414.

[Goodrich et al., 2001] Goodrich, M., Olsen, D., Crandall, J., and Palmer, T. (2001). Ex-periments in adjustable autonomy. In Proceedings of the IJCAI Workshop on Autonomy,Delegation and Control: Interacting with Intelligent Agents.

[Gray, 2004] Gray, J. (2004). Goal and action inference for helpful robots using self as sim-ulator. Master’s thesis, Massachusetts Institute of Technology.

[Grosz, 1996] Grosz, B. J. (1996). Collaborative systems. AI Magazine, 17(2):67–85.

[Grosz and Sidner, 1986] Grosz, B. J. and Sidner, C. L. (1986). Attention, intentions, and thestructure of discourse. Computational Linguistics, 12(3).

[Hamdan et al., 1999] Hamdan, R., Heitz, F., and Thoraval, L. (1999). Gesture localizationand recognition using probabilistic visual learning. In Proceedings of the 1999 Conferenceon Computer Vision and Pattern Recognition (CVPR ’99), pages 2098–2103, Ft. Collins, CO,USA.

[Hoffman, 2005a] Hoffman, G. (2005a). An experience-based framework for intention-reading. Technical report, MIT Media Laboratory, Cambridge, MA, USA.

[Hoffman, 2005b] Hoffman, G. (2005b). HRI: Four lessons from acting method. Technicalreport, MIT Media Laboratory, Cambridge, MA, USA.

[Hoffman and Breazeal, 2004] Hoffman, G. and Breazeal, C. (2004). Collaboration inhuman-robot teams. In Proc. of the AIAA 1st Intelligent Systems Technical Conference,Chicago, IL, USA. AIAA.

38


[Hoffman and Breazeal, 2006a] Hoffman, G. and Breazeal, C. (2006a). Cost-based anticipa-tory action-selection for human-robot fluency. Unpublished.

[Hoffman and Breazeal, 2006b] Hoffman, G. and Breazeal, C. (2006b). What lies ahead?expectation management in human-robot collaboration. In Working notes of the AAAISpring Symposium, Stanford University, Palo Alto, CA.

[Kimura et al., 1999] Kimura, H., Horiuchi, T., and Ikeuchi, K. (1999). Task-model basedhuman robot cooperation using vision. In Proceedings of the IEEE International Conferenceon Intelligent Robots and Systems (IROS’99), pages 701–706.

[Knoblich and Jordan, 2003] Knoblich, G. and Jordan, J. S. (2003). Action coordination ingroups and individuals: learning anticipatory control. Journal of Experimental Psychology:Learning, Memory, and Cognition, 29(5):1006–1016.

[Kosslyn, 1995] Kosslyn, S. (1995). Mental imagery. In Kosslyn, S. and D.N.Osherson,editors, Invitation to Cognitive Science: Visual Cognition, volume 2, chapter 7, pages 276–296. MIT Press, Cambridge, MA, 2nd edition.

[Krauss et al., 1996] Krauss, R. M., Chen, Y., and Chawla, P. (1996). Nonverbal behaviorand nonverbal communication: What do conversational hand gestures tell us? In Zanna,M., editor, Advances in experimental social psychology, pages 389–450. Tampa: AcademicPress.

[Lakoff and Johnson, 1999] Lakoff, G. and Johnson, M. (1999). Philosophy in the flesh : theembodied mind and its challenge to Western thought. Basic Books, New York.

[Les Freres Corbusier, 2006] Les Freres Corbusier (2006). Heddatron.http://www.lesfreres.org/heddatron/.

[Locke and Kutz, 1975] Locke, J. L. and Kutz, K. J. (1975). Memory for speech and speechfor memory. Journal of Speech and Hearing Research, 18:176–191.

[Loyall et al., 2004] Loyall, A. B., Reilly, W. S. N., Bates, J., and Weyhrauch, P. (2004). Sys-tem for authoring highly interactive, personality-rich interactive characters. In SCA ’04:Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation,pages 59–68. ACM Press.

[Maes et al., 1995] Maes, P., Darrell, T., Blumberg, B., and Pentland, A. (1995). The ALIVEsystem: full-body interaction with autonomous agents. In CA ’95: Proceedings of theComputer Animation, page 11. IEEE Computer Society.

[Magerko and Laird, 2003] Magerko, B. and Laird, J. (2003). Building an interactive dramaarchitecture with a high degree of interactivity. In 1st International Conference on Technolo-gies for Interactive Digital Storytelling and Entertainment, Darmstadt, Germany.

[Malle et al., 2001] Malle, B., Moses, L., and Baldwin, D., editors (2001). Intentions and In-tentionality. MIT Press.

39


[Martin, 2001] Martin, A. (2001). Functional neuroimaging of semantic memory. In Cabeza,R. and A. Kingstone Kingstone, A., editors, Handbook of Functional Neuroimaging of Cog-nition, pages 153–186. MIT Press.

[Massaro and Cohen, 1983] Massaro, D. and Cohen, M. M. (1983). Evaluation and inte-gration of visual and auditory information in speech perception. Journal of ExperimentalPsychology: Human Perception and Performance, 9(5):753–771.

[Mataric, 2002] Mataric, M. J. (2002). Sensory-motor primitives as a basis for imitation:linking perception to action and biology to robotics. In Dautenhahn, K. and Nehaniv,C. L., editors, Imitation in animals and artifacts, chapter 15, pages 391–422. MIT Press.

[McClave, 2000] McClave, E. (2000). Linguistic functions of head movements in the contextof speech. Journal of Pragmatics, 32:855–878.

[McLeod and Posner, 1984] McLeod, P. and Posner, M. I. (1984). Privileged loops frompercept to act. In Bouma, H. and Bouwhuis, D. G., editors, Attention and performance,volume 10, pages 55–66. Erlbaum, Hillsdale, NJ.

[Meisner and Longwell, 1987] Meisner, S. and Longwell, D. (1987). Sanford Meisner on Act-ing. Vintage, 1st edition.

[Meltzoff and Moore, 1997] Meltzoff, A. N. and Moore, M. K. (1997). Explaining facial im-itation: a theoretical model. Early Development and Parenting, 6:179–192.

[Moore, 1968] Moore, S. (1968). Training an Actor: The Stanislavski System in Class. VikingPress, New York, NY.

[Pecher and Zwaan, 2005] Pecher, D. and Zwaan, R. A., editors (2005). Grounding cognition:the role of perception and action in memory, language, and thinking. Cambridge Univ. Press,Cambridge, UK.

[Pentland, 2004] Pentland, A. (2004). Social dynamics: Signals and behavior. In Proceedingsof the Third International Conference on Development and Learning: ICDL 2004, La Jolla, CA.

[Pentland et al., 2005] Pentland, A., Choudhury, T., Eagle, N., and Singh, P. (2005). Humandynamics: computation for organizations. Pattern Recognition Letters, 26(4):503–511.

[Perlin and Goldberg, 1996] Perlin, K. and Goldberg, A. (1996). Improv: a system forscripting interactive actors in virtual worlds. In SIGGRAPH ’96: Proceedings of the 23rd an-nual conference on Computer graphics and interactive techniques, pages 205–216. ACM Press.

[Pinhanez, 1999] Pinhanez, C. (1999). Representation and recognition of action in interactivespaces. PhD thesis, MIT Media Lab.

[Pinhanez and Bobick, 2002] Pinhanez, C. S. and Bobick, A. F. (2002). ”it/i”: a theaterplay featuring an autonomous computer character. Presence: Teleoper. Virtual Environ.,11(5):536–548.

40


[Reilly, 1995] Reilly, W. S. (1995). The art of creating emotional and robust interactive char-acters. In Working notes of the AAAI Spring Symposium on Interactive Story Systems, Stan-ford University, Palo Alto, CA.

[Rich et al., 2001] Rich, C., Sidner, C. L., and Lesh, N. (2001). Collagen: Applying collabo-rative discourse theory to human-computer collaboration. AI Magazine, 22(4):15–25.

[Roy and Pentland, 2002] Roy, D. K. and Pentland, A. P. (2002). Learning words from sightsand sounds: a computational model. Cognitive Science, 26(1):113–146.

[Sakita et al., 2004] Sakita, K., Ogawam, K., Murakami, S., Kawamura, K., and Ikeuchi, K.(2004). Flexible cooperation between human and robot by interpreting human intentionfrom gaze information. In Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS 2004), pages 846–851.

[Searle, 1990] Searle, J. R. (1990). Collective intentions and actions. In Cohen, P. R., Morgan,J., and Pollack, M. E., editors, Intentions in Communication, chapter 19, pages 401–416.MIT Press, Camridge, MA.

[Sebanz et al., 2006] Sebanz, N., Bekkering, H., and Knoblich, G. (2006). Joint action: bod-ies and minds moving together. Trends in Cognitive Sciences, 10(2):70–76.

[Sidner et al., 2003] Sidner, C., Lee, C., and Lesh, N. (2003). Engagement rules for human-robot collaborative interaction. In Proceedings of the IEEE International Conference on Sys-tems, Man and Cybernetics, volume 4, pages 3957–3962.

[Sidner et al., 2006] Sidner, C. L., Lee, C., Morency, L.-P., and Forlines, C. (2006). The effectof head-nod recognition in human-robot conversation. In HRI ’06: Proceeding of the 1stACM SIGCHI/SIGART conference on Human-robot interaction, pages 290–296, New York,NY, USA. ACM Press.

[Simmons and Barsalou, 2003] Simmons, K. and Barsalou, L. W. (2003). The similarity-in-topography principle: Reconciling theories of conceptual deficits. Cognitive Neuropsy-chology, 20:451–486.

[Solomon and Barsalou, 2001] Solomon, K. O. and Barsalou, L. W. (2001). Representingproperties locally. Cognitive Psychology, 43:129–169.

[Somerville et al., 2004] Somerville, J. A., Woodward, A. L., and Needham, A. (2004). Ac-tion experience alters 3-month-old infants’ perception of others actions. Cognition.

[Spivey et al., 2005] Spivey, M. J., Richardson, D. C., and Gonzalez-Marquez, M. (2005).On the perceptual-motor and image-schematic infrastructure of language. In Pecher, D.and Zwaan, R. A., editors, Grounding cognition: the role of perception and action in memory,language, and thinking. Cambridge Univ. Press, Cambridge, UK.

41


[Stanfield and Zwaan, 2001] Stanfield, R. and Zwaan, R. (2001). The effect of implied orien-tation derived from verbal context on picture recognition. Psychological Science, 12:153–156.

[Stebbins, 1887] Stebbins, G. (1887). Delsarte System of Expression. Edgar S. Werner, NewYork, NY, 2nd edition.

[Swartout et al., 2005] Swartout, W., Gratch, J., Hill, R., Hovy, E., Lindheum, E., Marsella,S., Rickel, J., and Traum, D. (2005). Simulation meets hollywood: Integrating graphics,sound, story and character for immersive simulation. In Stock, O. and Zancanaro, M.,editors, Multimodal Intelligent Information Presentation Series: Text, Speech and LanguageTechnology, volume 27. Springer-Verlag.

[Thomas and Martin, 1976] Thomas, E. A. C. and Martin, J. A. (1976). Analysis of parent-infant interaction. Psychological Review, 83(2):141–156.

[Thornton et al., 1998] Thornton, I., J., P., and Shiffrar, M. (1998). The visual perception ofhuman locomotion. Cognitive Neuropsychology, 15:535–552.

[Vala et al., 2003] Vala, M., Paiva, A., and Gomes, M. (2003). Happy characters don’t feelwell in sad bodies! Lecture Notes in Computer Science, 2792:72–79.

[Wilson, 2001] Wilson, M. (2001). Perceiving imitable stimuli: consequences of isomor-phism between input and output. Psychological Bulletin, 127(4):543–553.

[Wilson, 2002] Wilson, M. (2002). Six views of embodied cognition. Psychonomic Bulletin &Review, 9(4):625–636.

[Wilson and Knoblich, 2005] Wilson, M. and Knoblich, G. (2005). The case for motor in-volvement in perceiving conspecifics. Psychological Bulletin, 131:460–473.

[Wilson et al., 2004] Wilson, S., Saygin, A., Sereno, M., and Iacoboni, M. (2004). Listening tospeech activates motor areas involved in speech production. Nature Neuroscience, 7:701–702.

[Woodward et al., 2001] Woodward, A. L., Sommerville, J. A., and Guajardo, J. J. (2001).How infants make sense of intentional actions. In Malle, B., Moses, L., and Baldwin,D., editors, Intentions and Intentionality: Foundations of Social Cognition, chapter 7, pages149–169. MIT Press, Cambridge, MA.

[Wren et al., 1999] Wren, C., Basu, S., Sparacino, F., and Pentland, A. (1999). Combiningaudio and video in perceptive spaces. In Proceedings of: Managing Interactions in SmartEnvironments (MANSE 99), Dublin, Ireland.

[Wren et al., 2000] Wren, C., Clarkson, B., and Pentland, A. (2000). Understanding pur-poseful human motion. In Proc. Fourth IEEE International Conference on Automatic Faceand Gesture Recognition, pages 378–383.

42


[Wren and Pentland, 1997] Wren, C. R. and Pentland, A. (1997). Dynaman: A recursivemodel of human motion. Technical Report VisMod 451, MIT Media Laboratory.

Biography

Guy Hoffman is a Ph.D. candidate in the Robotic Life group at the MIT Me-dia Lab. He holds an M.Sc. in Computer Science from Tel Aviv University,and has studied animation and design at Parsons School of Design in NewYork.

43

fluency and embodiment for robots acting with humansguy/publications/thesis... · 2007. 2. 14. ·...

Documents